Note: Some of the code
has been omitted to enhance readability.
The following terms
might be used interchangeably with each other:
Rows = Observations =
Records
Columns = Variables =
Features
Purpose and
Objectives:
The purpose of this project is to showcase my knowledge in Data
Analytics, focusing on the application of machine learning
techniques for text classification. My objective is to
attempt predicting the sentiment —positive, neutral, or
negative— of individuals based on their Twitter messages in regards to
COVID-19, using those words that get repeated 100 times or
more. The model that will be used here is the
Multinomial Naive Bayes.
Data
Collection
Loading the
data:
training <- read.csv(file = "Corona_NLP_train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE, strip.white = TRUE, encoding = "latin1")
#encoding = "UTF-8"
testing <- read.csv(file = "Corona_NLP_test.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE, strip.white = TRUE, encoding = "latin1")
Data
Exploration
Training data
set
Displaying the structure of the
Training data set:
'data.frame': 41157 obs. of 6 variables:
$ UserName : int 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 ...
$ ScreenName : int 48751 48752 48753 48754 48755 48756 48757 48758 48759 48760 ...
$ Location : chr "London" "UK" "Vagabonds" "" ...
$ TweetAt : chr "16-03-2020" "16-03-2020" "16-03-2020" "16-03-2020" ...
$ OriginalTweet: chr "@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8" "advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neigh"| __truncated__ "Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak htt"| __truncated__ "My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD F"| __truncated__ ...
$ Sentiment : chr "Neutral" "Positive" "Positive" "Positive" ...
Identifying missing (NAs)
values in the Training data set:
UserName ScreenName Location TweetAt OriginalTweet Sentiment
0 0 0 0 0 0
Based on the
frequency table, there are no missing values (NAs) in
the Training data frame.
Displaying the content of
the first two observations of the OriginalTweet variable
within the Training data set:
[1] "@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8"
[1] "advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order"
Displaying the frequency
and the distribution of the Sentiment variable within the
Training data set:
Extremely Negative Extremely Positive Negative Neutral Positive
5481 6624 9917 7713 11422


Testing data
set
Displaying the structure of the
Testing data set:
'data.frame': 3798 obs. of 6 variables:
$ UserName : int 1 2 3 4 5 6 7 8 9 10 ...
$ ScreenName : int 44953 44954 44955 44956 44957 44958 44959 44960 44961 44962 ...
$ Location : chr "NYC" "Seattle, WA" "" "Chicagoland" ...
$ TweetAt : chr "02-03-2020" "02-03-2020" "02-03-2020" "02-03-2020" ...
$ OriginalTweet: chr "TRENDING: New Yorkers encounter empty supermarket shelves (pictured, Wegmans in Brooklyn), sold-out online groc"| __truncated__ "When I couldn't find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a 2 pack of Purell??!!C"| __truncated__ "Find out how you can protect yourself and loved ones from #coronavirus. ?" "#Panic buying hits #NewYork City as anxious shoppers stock up on food&medical supplies after #healthcare wo"| __truncated__ ...
$ Sentiment : chr "Extremely Negative" "Positive" "Extremely Positive" "Negative" ...
Identifying missing (NAs)
values in the Testing data set:
UserName ScreenName Location TweetAt OriginalTweet Sentiment
0 0 0 0 0 0
Based on the
frequency table, there are also no missing values (NAs)
in the Testing data frame.
Displaying the content of
the first two observations of the OriginalTweet variable
within the Testing data set:
[1] "TRENDING: New Yorkers encounter empty supermarket shelves (pictured, Wegmans in Brooklyn), sold-out online grocers (FoodKick, MaxDelivery) as #coronavirus-fearing shoppers stock up https://t.co/Gr76pcrLWh https://t.co/ivMKMsqdT1"
[1] "When I couldn't find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a 2 pack of Purell??!!Check out how #coronavirus concerns are driving up prices. https://t.co/ygbipBflMY"
Displaying the frequency
and the distribution of the Sentiment variable within the
Testing data set:
Extremely Negative Extremely Positive Negative Neutral Positive
592 599 1041 619 947


Data
Preprocessing
Training data
set
Replacing
the entries labeled as Extremely Positive with Positive
and the entries labeled as Extremely Negative with
Negative of the Sentiment variable in the
Training data set:
training$Sentiment[training$Sentiment == "Extremely Positive"] <- "Positive"
training$Sentiment[training$Sentiment == "Extremely Negative"] <- "Negative"
Displaying the frequency
and the distribution of the Sentiment variable within the
Training data set:
Negative Neutral Positive
15398 7713 18046


Factorizing the
Sentiment variable, assigning numeric values to the resulting
factors, and adding them to the Training data set as a new column
named Labels:
training["Labels"] <- as.integer(as.factor(training$Sentiment)) - 1
Displaying the frequency
of the Labels variable within the Training data
set:
0 1 2
15398 7713 18046
The frequency
counts of the Labels variable exactly match those of
the frequency of the Sentiment column in the
Training data set..
Cleaning and transforming
the text of the OriginalTweet variable in the
Training data set:
training$OriginalTweet <- replace_non_ascii(training$OriginalTweet)
training$OriginalTweet <- tolower(training$OriginalTweet)
training$OriginalTweet <- replace_date(training$OriginalTweet)
training$OriginalTweet <- replace_time(training$OriginalTweet)
training$OriginalTweet <- replace_money(training$OriginalTweet)
training$OriginalTweet <- replace_number(training$OriginalTweet)
training$OriginalTweet <- replace_contraction(training$OriginalTweet, ignore.case = TRUE)
training$OriginalTweet <- replace_grade(training$OriginalTweet)
training$OriginalTweet <- replace_email(training$OriginalTweet)
training$OriginalTweet <- replace_hash(training$OriginalTweet)
training$OriginalTweet <- replace_tag(training$OriginalTweet)
training$OriginalTweet <- replace_html(training$OriginalTweet)
training$OriginalTweet <- replace_url(training$OriginalTweet)
training$OriginalTweet <- replace_internet_slang(training$OriginalTweet, ignore.case = TRUE)
training$OriginalTweet <- replace_emoji(training$OriginalTweet)
training$OriginalTweet <- replace_emoticon(training$OriginalTweet)
training$OriginalTweet <- removeWords(training$OriginalTweet, stopwords("english"))
training$OriginalTweet <- replace_symbol(training$OriginalTweet)
training$OriginalTweet <- mgsub(training$OriginalTweet, "'", "")
training$OriginalTweet <- strip(training$OriginalTweet)
training$OriginalTweet <- replace_kern(training$OriginalTweet)
training$OriginalTweet <- wordStem(training$OriginalTweet, language = "english")
training$OriginalTweet <- replace_white(training$OriginalTweet)
training <- training[!(training$OriginalTweet== ""), ]
Setting up a Vector
Source to treat each element of the OriginalTweet variable
in the Training data set as individual
documents:
training.vs <- VectorSource(training$OriginalTweet)
Displaying the structure
of the Training vector source:
Classes 'VectorSource', 'SimpleSource', 'Source' hidden list of 5
$ encoding: chr ""
$ length : int 41119
$ position: num 0
$ reader :function (elem, language, id)
$ content : chr [1:41119] "advice talk neighbours family exchange phone numbers create contact list phone numbers neighbours schools emplo"| __truncated__ "coronavirus australia woolworths give elderly disabled dedicated shopping hours amid covid outbreak" "food stock one empty please panic will enough food everyone take need stay calm stay saf" "ready go supermarket outbreak i paranoid food stock litteraly empty serious thing please panic causes shortag" ...
Creating a Volatile
Corpus from the Training vector source:
training.vc <- VCorpus(training.vs)
Displaying details about
the Volatile Corpus:
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 41119
Displaying details of the
first two documents within the Training Volatile
Corpus:
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 194
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 99
Displaying the content of
the first two documents within the Training Volatile
Corpus:
[1] "advice talk neighbours family exchange phone numbers create contact list phone numbers neighbours schools employer chemist gp set online shopping accounts poss adequate supplies regular meds ord"
[1] "coronavirus australia woolworths give elderly disabled dedicated shopping hours amid covid outbreak"
Displaying a
Wordcloud of the top 100 most frequent
words:

Displaying a list of the
top 10 most frequent words:
Displaying a
Wordcloud of the top 100 most frequent words of the
positive sentiments:

Displaying a list of the
top 10 most frequent words of the positive
sentiments:
Generating a
Document-Term Matrix from the Training Volatile Corpus to
depict the frequency of terms within each document in the
collection:
dtm <- DocumentTermMatrix(training.vc)
Inspecting the contents of
the Document-Term Matrix and displaying it in a human-readable format.
This displays terms and their frequencies within each document in the
collection:
tm::inspect(dtm)
<<DocumentTermMatrix (documents: 41119, terms: 37869)>>
Non-/sparse entries: 638798/1556496613
Sparsity : 100%
Maximal term length: 127
Weighting : term frequency (tf)
Sample :
Terms
Docs can consumer covid food grocery people prices store supermarket will
18898 0 0 0 0 0 0 1 0 0 1
26959 0 0 0 0 0 1 1 0 0 0
28881 0 0 1 0 0 0 1 0 0 0
31912 0 0 1 0 0 0 1 0 0 0
32027 0 0 1 0 0 0 1 0 0 0
32028 0 0 1 0 0 0 1 0 0 0
33965 0 0 1 1 0 0 0 0 0 0
34261 0 0 0 0 0 0 0 0 0 0
37744 0 0 0 0 0 0 0 0 0 0
9770 0 0 0 0 0 0 1 0 0 0
Showing the number of
terms that are repeated 100 times or more in the Document-Term
Matrix from the Training data set:
# Gets a list of those terms repeated 100 times or more from the Document-Term Matrix:
high.freq.terms <- findFreqTerms(dtm, lowfreq = 100)
# Display the quantity of those terms repeated 100 times or more:
length(high.freq.terms)
[1] 1121
Displaying a list of terms
repeated 100 times or more in the Document-Term Matrix from the
Training data set:
Showing the number of
terms that are repeated less than 100 times in the Document-Term
Matrix from the Training data set:
[1] 36748
Displaying a list of terms
repeated less 100 times (with a low frequency of 90) in the
Document-Term Matrix from the Training data
set:
Removing those terms
repeated less than 100 times from the Document-Term Matrix from the
Training data set:
dtm <- dtm[, colnames(dtm) %in% high.freq.terms]
Confirming there are no
terms repeated less than 100 times in the Document-Term Matrix
from the Training data set:
cat("Terms repeated less than 100 times: ", length(findFreqTerms(dtm, highfreq = 99)))
Terms repeated less than 100 times: 0
Testing data
set
Replacing
the entries labeled as Extremely Positive with Positive
and the entries labeled as Extremely Negative with
Negative of the Sentiment variable in the Testing
data set:
testing$Sentiment[testing$Sentiment == "Extremely Positive"] <- "Positive"
testing$Sentiment[testing$Sentiment == "Extremely Negative"] <- "Negative"
Displaying the frequency
and the distribution of the Sentiment variable within the
Testing data set:
Negative Neutral Positive
1633 619 1546


Factorizing the
Sentiment variable, assigning numeric values to the resulting
factors, and adding them to the Testing data set as a new column
named Labels:
testing["Labels"] <- as.integer(as.factor(testing$Sentiment)) - 1
Displaying the frequency
of the Labels variable within the Testing data
set:
0 1 2
1633 619 1546
The frequency
counts of the Labels variable exactly match those of
the frequency of the Sentiment column in the
Testing data set.
Cleaning and transforming
the text of the OriginalTweet variable in the
Testing data set:
testing$OriginalTweet <- replace_non_ascii(testing$OriginalTweet)
testing$OriginalTweet <- tolower(testing$OriginalTweet)
testing$OriginalTweet <- replace_date(testing$OriginalTweet)
testing$OriginalTweet <- replace_time(testing$OriginalTweet)
testing$OriginalTweet <- replace_money(testing$OriginalTweet)
testing$OriginalTweet <- replace_number(testing$OriginalTweet)
testing$OriginalTweet <- replace_contraction(testing$OriginalTweet, ignore.case = TRUE)
testing$OriginalTweet <- replace_grade(testing$OriginalTweet)
testing$OriginalTweet <- replace_email(testing$OriginalTweet)
testing$OriginalTweet <- replace_hash(testing$OriginalTweet)
testing$OriginalTweet <- replace_tag(testing$OriginalTweet)
testing$OriginalTweet <- replace_html(testing$OriginalTweet)
testing$OriginalTweet <- replace_url(testing$OriginalTweet)
testing$OriginalTweet <- replace_internet_slang(testing$OriginalTweet, ignore.case = TRUE)
testing$OriginalTweet <- replace_emoji(testing$OriginalTweet)
testing$OriginalTweet <- replace_emoticon(testing$OriginalTweet)
testing$OriginalTweet <- removeWords(testing$OriginalTweet, stopwords("english"))
testing$OriginalTweet <- replace_symbol(testing$OriginalTweet)
testing$OriginalTweet <- mgsub(testing$OriginalTweet, "'", "")
testing$OriginalTweet <- strip(testing$OriginalTweet)
testing$OriginalTweet <- replace_kern(testing$OriginalTweet)
testing$OriginalTweet <- wordStem(testing$OriginalTweet, language = "english")
testing$OriginalTweet <- replace_white(testing$OriginalTweet)
testing <- testing[!(testing$OriginalTweet== ""), ]
Setting up a Vector
Source to treat each element of the OriginalTweet variable
in the Testing data set as individual
documents:
testing.vs <- VectorSource(testing$OriginalTweet)
Creating a Volatile
Corpus from the Testing vector source:
testing.vc <- VCorpus(testing.vs)
Generating a
Document-Term Matrix utilizing the Training Document-Term
Matrix and the Testing Volatile Corpus. This process ensures that
the terms found in the Training Document-Term Matrix are mirrored
in the Testing Document-Term Matrix:
test_dtm <- DocumentTermMatrix(testing.vc, control = list(dictionary = Terms(dtm)))
Showing the count of
documents and terms in both Document-Term Matrices (number of terms
should be the same):
DTM Documents Terms
[Training] 41119 1121
[Testing] 3797 1121
>>> The number of terms are the same!!!!!
Presenting the lists of
terms contained within both Document-Term Matrices (the terms should be
the same):
>>> The terms are identical!!!!!
Model Training and
Evaluation.:
Training a
Multinomial Naive Bayes model on the training data:
# Train the Naïve Bayes model on the training data,
multinomial.naive.bayes.model <- multinomial_naive_bayes(as.matrix(dtm), as.factor(training$Labels), laplace = 1)
Getting the
predictions of the model on the testing data:
# Getting the predictions of the Naïve model on the validation data.
multinomial.naive.bayes.predictions <- predict(multinomial.naive.bayes.model, newdata = as.matrix(test_dtm))
Displaying the
accuracy of the model:
Accuracy: 64.52%
Showing the count of the
incorrect predictions labels of the model:
Amount of incorrect predictions: 1347
Displaying a list of the
actual labels next to incorrect predictions labels of the
model:
Presenting a frequency
table of the actual labels of the Testing data
set:
Actual labels table:
0 1 2
1633 618 1546
Showing a frequency table
of the predictions labels of the model:
Predictions labels table:
multinomial.naive.bayes.predictions
0 1 2
1613 602 1582
Displaying a Confusion
Matrix of the actual labels of the Testing data set vs
the predictions labels of the model:
Confusion Matrix:
multinomial.naive.bayes.predictions
(Pred Neg) 0 (Pred Neu) 1 (Pred Pos) 2
(True Neg) 0 1106 194 333
(True Neu) 1 181 266 171
(True Pos) 2 326 142 1078
Presenting a Proportion
Confusion Matrix comparing the total count of the actual
labels from the Testing data set against the predicted
labels generated by the model.:
Proportion Confusion Matrix:
multinomial.naive.bayes.predictions
(Pred Neg) 0 (Pred Neu) 1 (Pred Pos) 2
(True Neg) 0 0.29128259 0.05109297 0.08770082
(True Neu) 1 0.04766921 0.07005531 0.04503555
(True Pos) 2 0.08585726 0.03739795 0.28390835
Based on the
Confusion Matrices, 2450 of the predictions were
correctly labeled, accounting for
64.52% of all predictions. These findings corroborate
previous data from this study. Subtracting 2450 from the total
predictions made (3797) yields 1347 (the number of incorrect predictions
calculated previously), and the percentage obtained from the Confusion
Matrices precisely matches the accuracy percentage of the model (without
rounding).
Of the incorrectly predicted values, it appears
that the model mislabeled more Negative values
as Positive and vice versa. Specifically, 333 Negative values
were labeled as Positive, constituting approximately 8.77% of all
predictions. This was closely followed by 326 Positive values labeled as
Negative, representing around 8.59% of all predictions.
Note: Use the following table
for the Confusion Matrices interpretation.
True-Negative False-Neutral False-Positive
False-Negative True-Neutral False-Positive
False-Negative False-Neutral True-Positive
---
title: <font color="black">Text Classification and Sentiment Analysis</font>
author: "Authored by <strong>Eloy Alvin Luna</strong>"
date: "Created on: January 26, 2024 - Last updated on: March 7, 2024"
output: html_notebook
---

```{r echo = FALSE}
# For the 3D Exploded Pie Chart:
if (!requireNamespace("plotrix", quietly = TRUE)) {
  install.packages("plotrix")
}

# For str_extract_all:
if (!requireNamespace("stringr", quietly = TRUE)) {
  install.packages("stringr")
}

# A dependency of the wordclouds:
if (!requireNamespace("RColorBrewer", quietly = TRUE)) {
  install.packages("RColorBrewer")
}

# To create the wordclouds:
if (!requireNamespace("wordcloud", quietly = TRUE)) {
  install.packages("wordcloud")
}

# For wordStem() - stemming transformation for Text Mining:
# https://www.rdocumentation.org/packages/SnowballC/versions/0.7.1/topics/wordStem
if (!requireNamespace("SnowballC", quietly = TRUE)) {
  install.packages("SnowballC")
}

# For replace_non_ascii()
# https://github.com/trinker/textclean
if (!requireNamespace("textclean", quietly = TRUE)) {
  install.packages("textclean")
}

# For check_text()
if (!requireNamespace("hunspell", quietly = TRUE)) {
  install.packages("hunspell")
}

# Natural Language Processing: Requirement of tm:
if (!requireNamespace("NLP", quietly = TRUE)) {
  install.packages("NLP")
}

# Text Mining: For VectorSource(), VCorpus(), and DocumentTermMatrix():
if (!requireNamespace("tm", quietly = TRUE)) {
  install.packages("tm")
}

# For Multinomial Naive Bayes
if (!requireNamespace("naivebayes", quietly = TRUE)) {
  install.packages("naivebayes")
}
```

<br>
<br>
<font color="red"><strong>Note: </strong>Some of the code has been omitted to enhance readability.</font>

<div style="color:black;background-color:lightyellow">
   <font size="2"><span style="color: darkblue">The following terms might be used interchangeably with each other:</span></font>
   <br>
   <br>
   <font size="2"><span style="color: darkblue">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>Rows</strong> = Observations = Records</span></font>
   <br>
   <font size="2"><span style="color: darkblue">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>Columns</strong> = Variables = Features</span></font>
</div>
<br>
<br>
<font size="+2" color="navy"><strong>Purpose and Objectives:</strong></span></font>
<br>
<p>The purpose of this project is to showcase my knowledge in <strong>Data Analytics</strong>, focusing on the application of machine learning techniques for <strong>text classification</strong>. My objective is to attempt <strong>predicting the sentiment</strong> —positive, neutral, or negative— of individuals based on their Twitter messages in regards to COVID-19, using those words that get repeated <strong>100 times or more</strong>. The model that will be used here is the <strong>Multinomial Naive Bayes</strong>.</p>
<br>
<br>
<font size="+2"><span style="color: navy;"><strong>Data Collection</strong></span></font>
<br>
<ul>
   <li>The data was obtained from https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification?resource=download on January/26/2024.</li>
   <li>Two files were collected with the names: "Corona_NLP_train.csv" and "Corona_NLP_test.csv"</li>
   <li>The files are in the <strong>comma-separated values (csv)</strong> format.
</ul>
<br>
<font size="+0" color="black"><strong>Loading the data:</strong></font>

<!-- In this area, I'm using the variable names "training" and "testing" to hold the loaded data of the to data set in two data frames. These two names will be used for displaying purposes, but other internal names will be used throughout the coding do to processes not meant to be seeing by the viewers of the html file. -->

<!-- Use Latin-1 ("latin1"), also called ISO-8859-1, an 8-bit character set endorsed by the International Organization for Standardization (ISO) to represent the alphabets of Western European languages. ISO Latin 1 was designed with the following languages in mind: Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.-->

<!-- Use UTF-8 ("UTF-8"), Unicode Transformation Format-8, to support all languages and alphabets, including Asian languages and their character depth. It is a widely supported and flexible character encoding. UTF-8 is capable of encoding all 1,112,064 valid Unicode code points using one to four one-byte (8-bit) code units. -->

<!-- Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices. -->
```{r}
training <- read.csv(file = "Corona_NLP_train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE, strip.white = TRUE, encoding = "latin1")

#encoding = "UTF-8"
testing <- read.csv(file = "Corona_NLP_test.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE, strip.white = TRUE, encoding = "latin1")
```
<!-- First substitution of the data frames' names. -->
```{r echo = FALSE}
df.train.raw <- training
df.test.raw <- testing
```
<br>
<font size="+2"><span style="color: navy;"><strong>Data Exploration</strong></span></font>
<br>
<br>
<font size="+1" color="magenta"><strong>Training data set</strong></font>
<br>
<font size="+0" color="black"><strong>Displaying the structure of the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
str(df.train.raw)
```
<br>
<font size="+0" color="black"><strong>Identifying missing (NAs) values in the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
colSums(is.na(df.train.raw))
```

<div style="color:black;background-color:lightyellow">
   <br>
   <font size="+0.3"><span style="color: darkblue">Based on the frequency table, there are <strong>no</strong> missing values (NAs) in the <strong>Training</strong> data frame.</span></font>
   <br>
   <br>
</div>

<br>
<font size="+0" color="black"><strong>Displaying the content of the <u>first two observations</u> of the <em>OriginalTweet</em> variable within the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
df.train.raw$OriginalTweet[[1]]
```
```{r echo = FALSE}
df.train.raw$OriginalTweet[[2]]
```

<br>
<font size="+0" color="black"><strong>Displaying the frequency and the distribution of the <em>Sentiment</em> variable within the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
# The frequency table of the Sentiment feature in the df.train.raw data frame.
tbl.raw <- table(df.train.raw$Sentiment)
tbl.raw
```
<!-- 
Although Histograms are most commonly used for visualizing the distribution, they are better used for continuous numerical data. For categorical data, bar plots are more appropriate.
-->
<!-- https://r-graph-gallery.com/209-the-options-of-barplot.html -->
```{r echo = FALSE}
# Category names (labels):
nms <- c("Extm. N.", "Extm. P.", "Neg.", "Neut.", "Pos.")
# Category colors:
clrs <- c("red", "blue", "pink", "white", "lightblue2")

# Creation of the bar plot with colors and customized bar width:
barplot(tbl.raw, names.arg = nms, main = "Sentiment's distribution in the Training data set\n(frequency)", col = clrs, width = c(0.75, 0.85, 1.5, 1.2, 1.7), xlab = "Categories", ylab = "Frequency", font.lab = 3, col.lab = "blue")
```

```{r echo = FALSE}
# Category names (labels):
names <- c("Extm. Neg.", "Extm. Pos.", "Neg.", "Neut.", "Pos.")
# Category percentages:
percentage <- round(tbl.raw / length(df.train.raw$Sentiment) * 100, 2)
# Adding the first parenthesis to the labels:
lbls <- paste("(", names, sep = "")
# Adding the second parenthesis and the percentages to labels:
lbls <- paste(lbls, percentage, sep = ") ")
# Adding the % symbols to labels
lbls <- paste(lbls,"%",sep = "")

# Creating the pie chart with the labels and percentages, and with the same colors as in the barplot.
pie(tbl.raw, main = "Sentiment's distribution in the Training data set\n(percentage)", labels = lbls, col = c("red", "blue", "pink", "white", "lightblue2"))
```
<font size="+1" color="magenta"><strong>Testing data set</strong></font>
<br>
<font size="+0" color="black"><strong>Displaying the structure of the <u>Testing</u> data set:</strong></font>
```{r echo = FALSE}
str(df.test.raw)
```
<br>
<font size="+0" color="black"><strong>Identifying missing (NAs) values in the <u>Testing</u> data set:</strong></font>
```{r echo = FALSE}
colSums(is.na(df.test.raw))
```
<div style="color:black;background-color:lightyellow">
   <br>
   <font size="+0.3"><span style="color: darkblue">Based on the frequency table, there are also <strong>no</strong> missing values (NAs) in the <strong>Testing</strong> data frame.</span></font>
   <br>
   <br>
</div>

<br>
<font size="+0" color="black"><strong>Displaying the content of the <u>first two observations</u> of the <em>OriginalTweet</em> variable within the <u>Testing</u> data set:</strong></font>
```{r echo = FALSE}
df.test.raw$OriginalTweet[[1]]
```
```{r echo = FALSE}
df.test.raw$OriginalTweet[[2]]
```

<br>
<font size="+0" color="black"><strong>Displaying the frequency and the distribution of the <em>Sentiment</em> variable within the <u>Testing</u> data set:</strong></font>
```{r echo = FALSE}
tbl2.raw <- table(df.test.raw$Sentiment)
tbl2.raw 
```

```{r echo = FALSE}
# Category names (labels):
nms2 <- c("E.N.", "E.P.", "N.", "Ne.", "P.")
# Category colors:
clrs2 <- c("brown", "darkgreen", "coral", "white", "lightgreen")

# Creation of the bar plot with colors and customized bar width:
barplot(tbl2.raw, names.arg = nms2, main = "Sentiment's distribution in the Testing data set\n(frequency)", col = clrs2, width = c(1.1, 1.2, 1.7, 1, 1.5), xlab = "Frequency", ylab = "Categories", font.lab = 3, col.lab = "blue", horiz = TRUE)

# For lines and angles:
# density = c(100,100,100,0,100), angle = c(0,0,90,11,0)
```
<!-- https://search.r-project.org/CRAN/refmans/plotrix/html/pie3D.html -->
```{r echo = FALSE}
# For the 3D Exploded Pie Chart"
library(plotrix)
```

```{r echo = FALSE}
# Category names (labels):
names <- c("Extm.Neg.", "Extm.Pos.", "Neg.", "Neut.", "Pos.")
# Category percentages:
percentage <- round(tbl2.raw / length(df.test.raw$Sentiment) * 100, 2)
# Adding the first parenthesis to the labels:
lbls <- paste("(", names, sep = "")
# Adding the second parenthesis and the percentages to labels:
lbls <- paste(lbls, percentage, sep = ") ")
# Adding the % symbols to labels
lbls <- paste(lbls,"%",sep = "")

# Creating the pie chart with the labels and percentages, and with the same colors as in the barplot.
pie3D(tbl2.raw, labels = lbls, explode = 0.05, main = "Sentiment's distribution in the Testing data set\n(percentage)", sub = "Pie Chart of distribution", col = clrs2, theta = 1.2, height = 0.005, start = 86)
```

<br>
<font size="+2"><span style="color: navy;"><strong>Data Preprocessing</strong></span></font>
<br>
<br>

```{r echo = FALSE}
library(stringr)
```
<font size="+1" color="magenta"><strong>Training data set</strong></font>
<br>
<font size="+0" color="black"><strong>Replacing the entries labeled as <u>Extremely Positive</u> with <u>Positive</u> and the entries labeled as <u>Extremely Negative</u> with <u>Negative</u> of the <em>Sentiment</em> variable in the <u>Training</u> data set:</strong></font>
```{r}
training$Sentiment[training$Sentiment == "Extremely Positive"] <- "Positive"
training$Sentiment[training$Sentiment == "Extremely Negative"] <- "Negative"
```
<br>
<font size="+0" color="black"><strong>Displaying the frequency and the distribution of the <em>Sentiment</em> variable within the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
table(training$Sentiment)
```

```{r echo = FALSE}
# Category names (labels):
nms <- c("Neg.", "Neut.", "Pos.")
# Category colors:
clrs <- c("pink", "white", "lightblue2")

# Creation of the bar plot with colors and customized bar width:
barplot(table(training$Sentiment), names.arg = nms, main = "Sentiment's distribution in the Training data set\n(frequency)", col = clrs, width = c(1.5, 1, 1.7), xlab = "Categories", ylab = "Frequency", font.lab = 3, col.lab = "blue")
```

```{r echo = FALSE}
# Category names (labels):
nms <- c("Neg.", "Neut.", "Pos.")
# Category percentages:
percentage <- round(table(training$Sentiment) / length(training$Sentiment) * 100, 2)
# Adding the first parenthesis to the labels:
lbls <- paste("(", nms, sep = "")
# Adding the second parenthesis and the percentages to labels:
lbls <- paste(lbls, percentage, sep = ") ")
# Adding the % symbols to labels
lbls <- paste(lbls,"%",sep = "")

# Creating the pie chart with the labels and percentages, and with the same colors as in the barplot.
pie(table(training$Sentiment), main = "Sentiment's distribution in the Training data set\n(percentage)", labels = lbls, col = c("pink", "white", "lightblue2"))
```
<font size="+0" color="black"><strong>Factorizing the <em>Sentiment</em> variable, assigning numeric values to the resulting factors, and adding them to the <u>Training</u> data set as a new column named <em>Labels</em>:</strong></font>
```{r echo = TRUE}
training["Labels"] <- as.integer(as.factor(training$Sentiment)) - 1
```

<br>
<font size="+0" color="black"><strong>Displaying the frequency of the <em>Labels</em> variable within the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
table(training$Labels)
```

<div style="color:black;background-color:lightyellow">
   <br>
   <font size="+0.3"><span style="color: darkblue">The frequency counts of the <strong>Labels</strong> variable exactly match those of the frequency of the <strong>Sentiment</strong> column in the <u>Training</u> data set..</span></font>
   <br>
   <br>
</div>

```{r echo = FALSE}
library(textclean)
library(SnowballC)
library(hunspell)
library(NLP)
library(tm)
```
<!--
You will need the packages "textclean", "tm" (plus "NLP") and "SnowballC" (plus "hunspell" for "check_text()").

For text mining, before "VectorSource", 
do these in the order they are:

1. replace_non_ascii()		 - (from textclean) Replaces non-ASCII with equivalent or remove
2. tolower() - (R's built-in function) Converts text to lowercase.
3. replace_date()		 - (from textclean) Replaces dates
4. replace_time()		 - (from textclean) Replaces time stamps
5. replace_money()		 - (from textclean) Replaces money in the form of $\d+.?\d{0,2}
6. replace_number()	- (from textclean) Replaces common numbers
7. replace_contraction(words, , ignore.case = TRUE)	- (from textclean) Replace contractions with both words
8. replace_grade()		 - (from textclean) Replaces grades (e.g., “A+”) with word equivalent
9. replace_email()		 - (from textclean) Replaces emails
10. replace_hash()		 - (from textclean) Replaces Twitter style hash tags (e.g., #rstats). Hashtag words won't provide any useful meaning to text in sentiment analysis.
11. replace_tag()		 - (from textclean) Replaces Twitter style handle tag (e.g., @trinker)
12. replace_html()		 - (from textclean) Replaces HTML tags and symbols
13. replace_url()		 - (from textclean) Replaces URLs
14. replace_internet_slang(words, ignore.case = TRUE)	 - 	(from textclean) Replace Internet slang with word equivalents
15. replace_emoji()		 - (from textclean) Replaces emojis with word equivalent or unique identifier
16. replace_emoticon()		 - (from textclean) Replaces emoticons with word equivalent

17. removeWords(words, stopwords("english")) - (from tm) Removes the stopwords.



18. replace_symbol()		 - (from textclean) Replaces common symbols
19. mgsub(words, "symbol-to-be-replaced", "replacement-symbol")	 - from (textclean) Multiple gsub. For some reason R is not removing the single quotes ' automatically, so the reason of using this after replace_symbol().
20. strip()	 - (from textclean) Remove all non word characters
21. replace_kern()		 - (from textclean) Replace spaces for >2 letter, all cap, words containing spaces in between letters
22. wordStem(words, language = "english") - (from SnowballC) Extracts the stems of each of the given words in the vector. Type getStemLanguages() to get a list of languages supported by wordStem().
23. replace_white()		 - (from textclean) Replace regex white space characters

24. Remove rows with empty values in the text column (see Note 2 below).

Note 1: Use the check_text() to see if after cleaning the text there are some potential issues (not all of them are necessary to correct), but do it privately (do not display it in the html and comment it when not using it.)

    25. check_text()	 - (from textclean) Text report of potential issues


Note 2: If you have a message about an empty row, CHECK FIRST that the row is actually empty, not just having some of the cells empty. Do not use the drop_empty_row() from textclean, instead delete the row as you normally would do.

    training <- training[!(training$OriginalTweet== ""), ]


https://github.com/trinker/textclean
-->


<br>
<font size="+0" color="black"><strong>Cleaning and transforming the <u>text</u> of the <em>OriginalTweet</em> variable in the <u>Training</u> data set:</strong></font>
```{r warning = FALSE}
training$OriginalTweet <- replace_non_ascii(training$OriginalTweet)
training$OriginalTweet <- tolower(training$OriginalTweet)
training$OriginalTweet <- replace_date(training$OriginalTweet)
training$OriginalTweet <- replace_time(training$OriginalTweet)
training$OriginalTweet <- replace_money(training$OriginalTweet)
training$OriginalTweet <- replace_number(training$OriginalTweet)
training$OriginalTweet <- replace_contraction(training$OriginalTweet, ignore.case = TRUE)

training$OriginalTweet <- replace_grade(training$OriginalTweet)
training$OriginalTweet <- replace_email(training$OriginalTweet)
training$OriginalTweet <- replace_hash(training$OriginalTweet)
training$OriginalTweet <- replace_tag(training$OriginalTweet)
training$OriginalTweet <- replace_html(training$OriginalTweet)
training$OriginalTweet <- replace_url(training$OriginalTweet)
training$OriginalTweet <- replace_internet_slang(training$OriginalTweet, ignore.case = TRUE)
training$OriginalTweet <- replace_emoji(training$OriginalTweet)
training$OriginalTweet <- replace_emoticon(training$OriginalTweet)

training$OriginalTweet <- removeWords(training$OriginalTweet, stopwords("english"))

training$OriginalTweet <- replace_symbol(training$OriginalTweet)

training$OriginalTweet <- mgsub(training$OriginalTweet, "'", "")

training$OriginalTweet <- strip(training$OriginalTweet)
training$OriginalTweet <- replace_kern(training$OriginalTweet)
training$OriginalTweet <- wordStem(training$OriginalTweet, language = "english")
training$OriginalTweet <- replace_white(training$OriginalTweet)

training <- training[!(training$OriginalTweet== ""), ]
```

<!-- ONLY perform this to check if there are things that need to be fixed or improved and
don't display this in the html. -->
```{r echo = FALSE, warning = FALSE}
#check_text(training$OriginalTweet)
```

<!-- A "volatile corpora" in the context of R refers to a collection of text data stored as R objects, that is completely stored in the computer's memory (RAM). This means that all the text information is readily available for analysis, and any changes or operations are done directly in the computer's memory without necessarily saving them to a more permanent storage location like a file. -->

<!-- "Corpora" are R objects held fully in memory. -->

<!-- "Corpus" typically refers to a collection of text documents. -->

<!-- "Volatile" implies that the corpus is fully kept in memory, meaning that all the data and changes to the corpus are stored in the computer's RAM (Random Access Memory). -->


<br>
<font size="+0" color="black"><strong>Setting up a <u>Vector Source</u> to treat each element of the <em>OriginalTweet</em> variable in the <u>Training</u> data set as individual documents:</strong></font>
```{r}
training.vs <- VectorSource(training$OriginalTweet)
```
<br>
<font size="+0" color="black"><strong>Displaying the structure of the <u>Training</u> vector source:</strong></font>
```{r echo = FALSE}
str(training.vs)
```
<br>
<font size="+0" color="black"><strong>Creating a <u>Volatile Corpus</u> from the <u>Training</u> vector source:</strong></font>
```{r}
training.vc <- VCorpus(training.vs)
```

```{r echo = FALSE}
# Deleting unused data frame:
rm(training.vs)
```
<br>
<font size="+0" color="black"><strong>Displaying details about the <u>Volatile Corpus</u>:</strong></font>
```{r echo = FALSE}
print(training.vc)
```
<br>
<font size="+0" color="black"><strong>Displaying details of the <u>first two documents</u> within the <u>Training</u> Volatile Corpus:</strong></font>
```{r echo = FALSE}
content(training.vc[1:2])
```
<br>
<font size="+0" color="black"><strong>Displaying the content of the <u>first two documents</u> within the <u>Training</u> Volatile Corpus:</strong></font>
```{r echo = FALSE}
content(training.vc[[1]])
```

```{r echo = FALSE}
content(training.vc[[2]])
```

```{r echo = FALSE}
# Creating a temporary data frame of the Volatile Corpus, in order to create the Wordclouds:
temp.df <- data.frame(Text = sapply(training.vc, as.character), stringsAsFactors = FALSE)
```

```{r echo = FALSE}
# Adding the Sentiment column (labels) to the temporary data frame:
temp.df["Sentiment"] <- training$Sentiment
```

```{r echo = FALSE}
library(RColorBrewer)
library(wordcloud)
```

```{r echo = FALSE}
# To be used to display wordclouds of only positive, negative, or neutral sentiments.
# Not all will be used in this analysis.
positive <- temp.df[temp.df$Sentiment == "Positive", ]
negative <- temp.df[temp.df$Sentiment == "Negative", ]
neutral <- temp.df[temp.df$Sentiment == "Neutral", ]
```

<br>
<font size="+0" color="black"><strong>Displaying a <u>Wordcloud</u> of the <u>top 100 most frequent words</u>:</strong></font>
```{r echo = FALSE, warning = FALSE}
# max.words limits the word cloud to display the top 100 most frequent words from the text data.
# In scale, the first value (3) determines the maximum size of the words, while the second value (0.5) determines the minimum size. Larger words will be scaled up to three (3) times their normal size, and smaller words will be scaled down to half (0.5) their normal size.
wordcloud(temp.df$Text, max.words = 100, scale = c(3, 0.5))
```

```{r echo = FALSE}
library(stringr)
```
<br>
<font size="+0" color="black"><strong>Displaying a list of the <u>top 10 most frequent words</u>:</strong></font>
```{r echo = FALSE}
# Extract words from temp.df$Text:
words <- unlist(str_extract_all(temp.df$Text, "\\b\\w+\\b"))
# Creates a frequency table of "words":
freq.of.words <- table(words)
# Sort the "frequency.of.words" table:
freq.sorted <- sort(freq.of.words, decreasing = TRUE)
# Retrieves the 10 top words from the "freq.sorted" table and display it as a data frame:
as.data.frame(top.ten <- head(freq.sorted, 10))
```

<br>
<font size="+0" color="black"><strong>Displaying a <u>Wordcloud</u> of the <u>top 100 most frequent words</u> of the <u>positive sentiments</u>:</strong></font>
```{r echo = FALSE, warning = FALSE}
# max.words limits the word cloud to display the top 100 most frequent words from the text data.
# In scale, the first value (3) determines the maximum size of the words, while the second value (0.5) determines the minimum size. Larger words will be scaled up to three (3) times their normal size, and smaller words will be scaled down to half (0.5) their normal size.
wordcloud(positive$Text, max.words = 100, scale = c(3, 0.5))
```
<br>
<font size="+0" color="black"><strong>Displaying a list of the <u>top 10 most frequent words</u> of the <u>positive sentiments</u>:</strong></font>
```{r echo = FALSE}
# Extract words from positive$Text:
words <- unlist(str_extract_all(positive$Text, "\\b\\w+\\b"))
# Creates a frequency table of "words":
freq.of.words <- table(words)
# Sort the "frequency.of.words" table:
freq.sorted <- sort(freq.of.words, decreasing = TRUE)
# Retrieves the 10 top words from the "freq.sorted" table and display it as a data frame:
as.data.frame(top.ten <- head(freq.sorted, 10))
```

<!-- A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.  -->
<br>
<font size="+0" color="black"><strong>Generating a <u>Document-Term Matrix</u> from the <u>Training</u> Volatile Corpus to depict the frequency of terms within each document in the collection:</strong></font>
```{r}
dtm <- DocumentTermMatrix(training.vc)
```

<br>
<font size="+0" color="black"><strong>Inspecting the contents of the Document-Term Matrix and displaying it in a human-readable format. This displays terms and their frequencies within each document in the collection:</strong></font>
```{r}
tm::inspect(dtm)
```

<br>
<font size="+0" color="black"><strong>Showing the number of terms that are repeated <u>100 times or more</u> in the Document-Term Matrix from the <u>Training</u> data set:</strong></font>
```{r}
# Gets a list of those terms repeated 100 times or more from the Document-Term Matrix:
high.freq.terms <- findFreqTerms(dtm, lowfreq = 100)
# Display the quantity of those terms repeated 100 times or more:
length(high.freq.terms)
```

<br>
<font size="+0" color="black"><strong>Displaying a list of terms repeated <u>100 times or more</u> in the Document-Term Matrix from the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
# Gets the names and the frequency of those terms repeated 100 times or more from the 
# Document-Term Matrix and store them as a Matrix:
high.term.frequencies <- colSums(as.matrix(dtm[, high.freq.terms]))
# Sorts the "high.term.frequencies" matrix:
sorted.high.term.frequencies <- sort(high.term.frequencies, decreasing = TRUE)
# Displays the "sorted.high.term.frequencies" matrix as a data frame (in order to improve
# readability by showing the results as a table with multiple tabs):
sorted.high.term.frequencies.df <- as.data.frame(sorted.high.term.frequencies)
sorted.high.term.frequencies.df
```

```{r echo = FALSE}
# Deleting variables that will not be used later:
rm(high.term.frequencies)
rm(sorted.high.term.frequencies)
rm(sorted.high.term.frequencies.df)
```

<br>
<font size="+0" color="black"><strong>Showing the number of terms that are repeated <u>less than 100 times</u> in the Document-Term Matrix from the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
# Gets a list of those terms repeated less than 100 times from the Document-Term Matrix:
low.freq.terms <- findFreqTerms(dtm, highfreq = 99)
# Display the quantity of those terms repeated less than 100 times:
length(low.freq.terms)
```

<br>
<font size="+0" color="black"><strong>Displaying a list of terms repeated <u>less 100 times</u> (with a low frequency of 90) in the Document-Term Matrix from the <u>Training</u> data set:</strong></font>
```{r echo = FALSE}
# Gets the names and the frequency of some of those terms repeated less than 100 (90-99) times 
# from the Document-Term Matrix and store them as a Matrix:
low.term.frequencies <- colSums(as.matrix(dtm[, findFreqTerms(dtm, highfreq = 99, lowfreq = 90) ]))
# Sorts the "low.term.frequencies" matrix:
sorted.low.term.frequencies <- sort(low.term.frequencies, decreasing = TRUE)
# Displays the "sorted.low.term.frequencies" matrix as a data frame (in order to improve
# readability by showing the results as a table with multiple tabs):
sorted.low.term.frequencies.df <- as.data.frame(sorted.low.term.frequencies)
sorted.low.term.frequencies.df
```

```{r echo = FALSE}
# Deleting variables that will not be used later:
rm(low.freq.terms)
rm(low.term.frequencies)
rm(sorted.low.term.frequencies)
rm(sorted.low.term.frequencies.df)
```

<br>
<font size="+0" color="black"><strong>Removing those terms repeated less than 100 times from the Document-Term Matrix from the <u>Training</u> data set:</strong></font>
```{r}
dtm <- dtm[, colnames(dtm) %in% high.freq.terms]
```

```{r echo = FALSE}
# Deleting variables that will not be used later:
rm(high.freq.terms)
```

<br>
<font size="+0" color="black"><strong>Confirming there are no terms repeated <u>less than 100 times</u> in the Document-Term Matrix from the <u>Training</u> data set:</strong></font>
```{r}
cat("Terms repeated less than 100 times: ", length(findFreqTerms(dtm, highfreq = 99)))
```

```{r echo = FALSE}
#str(tdm)
#inspect(tdm[1:3,])
#tdm[[1]]
#tm::inspect(tdm)
#tm::inspect(tdm[1:5, ])
```

<br>
<br>
<font size="+1" color="magenta"><strong>Testing data set</strong></font>
<br>
<font size="+0" color="black"><strong>Replacing the entries labeled as <u>Extremely Positive</u> with <u>Positive</u> and the entries labeled as <u>Extremely Negative</u> with <u>Negative</u> of the <em>Sentiment</em> variable in the <u>Testing</u> data set:</strong></font>
```{r}
testing$Sentiment[testing$Sentiment == "Extremely Positive"] <- "Positive"
testing$Sentiment[testing$Sentiment == "Extremely Negative"] <- "Negative"
```

<br>
<font size="+0" color="black"><strong>Displaying the frequency and the distribution of the <em>Sentiment</em> variable within the <u>Testing</u> data set:</strong></font>
```{r echo = FALSE}
table(testing$Sentiment)
```

```{r echo = FALSE}
# Category names (labels):
nms2 <- c("Neg.", "Neut.", "Pos.")
# Category colors:
clrs2 <- c("coral", "white", "lightgreen")

# Creation of the bar plot with colors and customized bar width:
barplot(table(testing$Sentiment), names.arg = nms2, main = "Sentiment's distribution in the Testing data set\n(frequency)", col = clrs2, width = c(1.7, 0.8, 1.5), xlab = "Frequency", ylab = "Categories", font.lab = 3, col.lab = "blue", horiz = TRUE)

# For lines and angles:
# density = c(100,100,100,0,100), angle = c(0,0,90,11,0)
```

```{r echo = FALSE}
library(plotrix)
```

```{r echo = FALSE}
# Category names (labels):
names <- c("Neg.", "Neut.", "Pos.")
# Category percentages:
percentage <- round(table(testing$Sentiment) / length(testing$Sentiment) * 100, 2)
# Adding the first parenthesis to the labels:
lbls <- paste("(", names, sep = "")
# Adding the second parenthesis and the percentages to labels:
lbls <- paste(lbls, percentage, sep = ") ")
# Adding the % symbols to labels
lbls <- paste(lbls,"%",sep = "")

# Creating the pie chart with the labels and percentages, and with the same colors as in the barplot.
pie3D(table(testing$Sentiment), labels = lbls, explode = 0.05, main = "Sentiment's distribution in the Testing data set\n(percentage)", sub = "Pie Chart of distribution", col = clrs2, theta = 1.2, height = 0.005, start = 86)
```

<br>
<font size="+0" color="black"><strong>Factorizing the <em>Sentiment</em> variable, assigning numeric values to the resulting factors, and adding them to the <u>Testing</u> data set as a new column named <em>Labels</em>:</strong></font>
```{r}
testing["Labels"] <- as.integer(as.factor(testing$Sentiment)) - 1
```

<br>
<font size="+0" color="black"><strong>Displaying the frequency of the <em>Labels</em> variable within the <u>Testing</u> data set:</strong></font>
```{r echo = FALSE}
table(testing$Labels)
```

<div style="color:black;background-color:lightyellow">
   <br>
   <font size="+0.3"><span style="color: darkblue">The frequency counts of the <strong>Labels</strong> variable exactly match those of the frequency of the <strong>Sentiment</strong> column in the <u>Testing</u> data set.</span></font>
   <br>
   <br>
</div>

<br>
<font size="+0" color="black"><strong>Cleaning and transforming the <u>text</u> of the <em>OriginalTweet</em> variable in the <u>Testing</u> data set:</strong></font>
```{r warning = FALSE}
testing$OriginalTweet <- replace_non_ascii(testing$OriginalTweet)
testing$OriginalTweet <- tolower(testing$OriginalTweet)
testing$OriginalTweet <- replace_date(testing$OriginalTweet)
testing$OriginalTweet <- replace_time(testing$OriginalTweet)
testing$OriginalTweet <- replace_money(testing$OriginalTweet)
testing$OriginalTweet <- replace_number(testing$OriginalTweet)
testing$OriginalTweet <- replace_contraction(testing$OriginalTweet, ignore.case = TRUE)

testing$OriginalTweet <- replace_grade(testing$OriginalTweet)
testing$OriginalTweet <- replace_email(testing$OriginalTweet)
testing$OriginalTweet <- replace_hash(testing$OriginalTweet)
testing$OriginalTweet <- replace_tag(testing$OriginalTweet)
testing$OriginalTweet <- replace_html(testing$OriginalTweet)
testing$OriginalTweet <- replace_url(testing$OriginalTweet)
testing$OriginalTweet <- replace_internet_slang(testing$OriginalTweet, ignore.case = TRUE)
testing$OriginalTweet <- replace_emoji(testing$OriginalTweet)
testing$OriginalTweet <- replace_emoticon(testing$OriginalTweet)

testing$OriginalTweet <- removeWords(testing$OriginalTweet, stopwords("english"))

testing$OriginalTweet <- replace_symbol(testing$OriginalTweet)

testing$OriginalTweet <- mgsub(testing$OriginalTweet, "'", "")

testing$OriginalTweet <- strip(testing$OriginalTweet)
testing$OriginalTweet <- replace_kern(testing$OriginalTweet)
testing$OriginalTweet <- wordStem(testing$OriginalTweet, language = "english")
testing$OriginalTweet <- replace_white(testing$OriginalTweet)

testing <- testing[!(testing$OriginalTweet== ""), ]
```

<!-- ONLY perform this to check if there are things that need to be fixed or improved and
don't display this in the html. -->
```{r echo = FALSE, warning = FALSE}
#check_text(testing$OriginalTweet)
```

<br>
<font size="+0" color="black"><strong>Setting up a <u>Vector Source</u> to treat each element of the <em>OriginalTweet</em> variable in the <u>Testing</u> data set as individual documents:</strong></font>
```{r}
testing.vs <- VectorSource(testing$OriginalTweet)
```

<br>
<font size="+0" color="black"><strong>Creating a <u>Volatile Corpus</u> from the <u>Testing</u> vector source:</strong></font>
```{r}
testing.vc <- VCorpus(testing.vs)
```

```{r echo = FALSE}
# Deleting unused data frame:
rm(testing.vs)
```

<br>
<font size="+0" color="black"><strong>Generating a <u>Document-Term Matrix</u> utilizing the <u>Training</u> Document-Term Matrix and the <u>Testing</u> Volatile Corpus. This process ensures that the terms found in the <u>Training</u> Document-Term Matrix are mirrored in the <u>Testing</u> Document-Term Matrix:</strong></font>
<!-- Creates a Document-Term Matrix from the new corpus, using the vocabulary from the original DTM. -->
```{r}
test_dtm <- DocumentTermMatrix(testing.vc, control = list(dictionary = Terms(dtm)))
```

<br>
<font size="+0" color="black"><strong>Showing the count of documents and terms in both Document-Term Matrices (number of terms should be the same):</strong></font>
```{r echo = FALSE}
# Get the terms (column names) from each DTM
training.dtm.terms <- colnames(dtm)
testing.dtm.terms <- colnames(test_dtm)

# Check if both DTMs have the same number of terms (columns)
terms_count_match <- ncol(dtm) == ncol(test_dtm)

cat("DTM", "\t\t", "Documents", "\t", "Terms", end = "\n")
cat("[Training]", "\t", nrow(dtm), "\t\t", ncol(dtm), end = "\n")
cat("[Testing]", "\t", nrow(test_dtm), "\t\t", ncol(test_dtm), end = "\n")

if (terms_count_match == FALSE){
  cat("\n", ">>> WARNING: The number of terms are not the same!!!!!")
} else {
  cat("\n", ">>> The number of terms are the same!!!!!")
}
```

<br>
<font size="+0" color="black"><strong>Presenting the lists of terms contained within both Document-Term Matrices (the terms should be the same):</strong></font>
```{r echo = FALSE}
# Compare the terms to ensure they are identical
terms_match <- all(sort(training.dtm.terms) == sort(testing.dtm.terms))

print(as.data.frame(sort(training.dtm.terms)))
print(as.data.frame(sort(testing.dtm.terms)))

if (terms_match == FALSE){
  cat(">>> WARNING: The terms are not identical!!!!!")
} else {
  cat(">>> The terms are identical!!!!!")
}
```














<br>
<br>
<font size="+2" color="navy"><strong>Model Training and Evaluation.:</strong></span></font>
<br>
<!-- 

Naive Bayes Classifiers:

Packages: "e1071" and "naivebayes"

From: https://medium.com/@dancerworld60/demystifying-naïve-bayes-log-probability-and-laplace-smoothing-d6da61b0e70b

Also see: https://cran.r-project.org/web/packages/naivebayes/vignettes/intro_naivebayes.pdf


General Naive Bayes function:

    ["e1071"]
    naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)

    ["naivebayes"]
    naive_bayes(x, y, prior = NULL, laplace = 0, usekernel = FALSE, usepoisson = FALSE, ...)


1. Gaussian Naive Bayes -  This type of Naive Bayes is used when the dataset consists of numerical features. It assumes that the features follow a Gaussian (normal) distribution. Gaussian Naive Bayes is suitable for continuous data. (It's not commonly used for text classification where features are usually discrete.)

2. Categorical Naive Bayes - When the dataset contains categorical features, such as colors or types of objects, we use Categorical Naive Bayes. It assumes that each feature follows a categorical distribution. (It's not commonly used for text classification unless the text features are converted into categorical variables.)

3. Bernoulli Naive Bayes - Bernoulli Naive Bayes is applied when the features are binary or follow a Bernoulli distribution. This type of Naive Bayes is suitable for datasets with binary features like presence or absence of certain attributes. (It's commonly used for text classification tasks, especially when representing text data using binary term presence/absence vectors.)

4. Multinomial Naive Bayes - Multinomial Naive Bayes is commonly used for text classification tasks. It assumes that features represent the frequencies or occurrences of different words in the text. It is suitable for datasets with discrete features and follows a multinomial distribution. (It's one of the MOST COMMONLY used classifiers for text classification tasks.)

    ["naivebayes"]
    multinomial_naive_bayes(x, y, prior = NULL, laplace = 0.5, ...)
        Note: x has to be a Matrix.
        Note: y has to be a class vector (character/factor/logical).
        Note: laplace has the default number of 0.5.

5. Complement Naive Bayes -  Complement Naive Bayes is a variation of Naive Bayes that is designed to address imbalanced datasets. It is particularly useful when the majority class overwhelms the minority class in the dataset. It aims to correct the imbalance by considering the complement of each class when making predictions. (It's useful when dealing with imbalanced text classification datasets.)

-->

```{r echo = FALSE, warning = FALSE}
library(naivebayes)
```

<br>
<font size="+0" color="black"><strong><u>Training</u> a Multinomial Naive Bayes model on the training data:</strong></font>
```{r}
# Train the Naïve Bayes model on the training data,
multinomial.naive.bayes.model <- multinomial_naive_bayes(as.matrix(dtm), as.factor(training$Labels), laplace = 1)
```

<br>
<font size="+0" color="black"><strong>Getting the <u>predictions</u> of the model on the testing data:</strong></font>
```{r}
# Getting the predictions of the Naïve model on the validation data.
multinomial.naive.bayes.predictions <- predict(multinomial.naive.bayes.model, newdata = as.matrix(test_dtm))
```

<br>
<font size="+0" color="black"><strong>Displaying the <u>accuracy</u> of the model:</strong></font>
```{r echo = FALSE}
# Evaluating model accuracy
#
# Storing the number of predictions made by the Naïve Bayes model.
total_predictions <- length(multinomial.naive.bayes.predictions)

# Calculating the amount of correct predictions made by the Naïve Bayes model.
correct_predictions <- sum(multinomial.naive.bayes.predictions == testing$Labels)

# Obtaining the accuracy of the Naïve Bayes model's predictions.
naive.accuracy <- (correct_predictions / total_predictions) * 100

# Displaying the accuracy
cat("Accuracy: ", round(naive.accuracy, 2), "%", sep = "")
```

<br>
<font size="+0" color="black"><strong>Showing the count of the <u>incorrect predictions labels</u> of the model:</strong></font>
```{r echo = FALSE}
incorrect.predictions <- sum(testing$Labels != multinomial.naive.bayes.predictions)

cat("Amount of incorrect predictions:", incorrect.predictions)
```

<br>
<font size="+0" color="black"><strong>Displaying a list of the <u>actual labels</u> next to <u>incorrect predictions labels</u> of the model:</strong></font>
```{r echo = FALSE}
# A matrix to hold all incorrect predictions.
incorrect.values.matrix <- matrix(NA, nrow = incorrect.predictions, ncol = 2)

counter <- 0
for (i in 1:length(testing$Labels)){
    if(testing$Labels[i] != multinomial.naive.bayes.predictions[i]){
        counter = counter + 1
        incorrect.values.matrix[counter,1] <- testing$Labels[i]
        incorrect.values.matrix[counter,2] <- multinomial.naive.bayes.predictions[1]
    }
}

# Convert the matrix to a data frame
incorrectValues.df <- as.data.frame(incorrect.values.matrix)

# Set column names
colnames(incorrectValues.df) <- c("Actual", "Predicted")

incorrectValues.df
```

<br>
<font size="+0" color="black"><strong>Presenting a frequency table of the <u>actual labels</u> of the <u>Testing</u> data set:</strong></font>
```{r echo = FALSE}
cat("Actual labels table:", end="\n")
table(testing$Labels)
```

<br>
<font size="+0" color="black"><strong>Showing a frequency table of the <u>predictions labels</u> of the model:</strong></font>
```{r echo = FALSE}
cat("Predictions labels table:", end="\n\n")
table(multinomial.naive.bayes.predictions)
```

<br>
<font size="+0" color="black"><strong>Displaying a <u>Confusion Matrix</u> of the <u>actual labels</u> of the <u>Testing</u> data set vs the <u>predictions labels</u> of the model:</strong></font>
```{r echo = FALSE}
# Note: if the predictions end up being of only one class, your Confusion Matrix will show only
# one column. The same can happens with the rows.
#
# Rows represent the levels or categories of the variable testing$Labels.
# Columns represent the levels or categories of the variable multinomial.naive.bayes.predictions
#
conf_matrix <- table(testing$Labels, multinomial.naive.bayes.predictions)

# Sets the custom names for rows and columns
custom_names <- c("(Pred Neg) 0", "(Pred Neu) 1", "(Pred Pos) 2")
colnames(conf_matrix) <- custom_names

custom_names <- c("(True Neg) 0", "(True Neu) 1", "(True Pos) 2")
rownames(conf_matrix) <- custom_names

cat("Confusion Matrix:", end="\n\n")
print(conf_matrix)
# TRUE-Negative     False-Neutral     False-Positive
# False-Negative    TRUE-Neutral      False-Positive
# False-Negative    False-Neutral     TRUE-Positive

```

<br>
<font size="+0" color="black"><strong>Presenting a <u>Proportion Confusion Matrix</u> comparing the total count of the <u>actual labels</u> from the <u>Testing</u> data set against the <u>predicted labels</u> generated by the model.:</strong></font>
```{r echo = FALSE}
# Note: if the predictions end up being of only one class, your Confusion Matrix will show only
# one column. The same can happens with the rows.
#
# Rows represent the levels or categories of the variable testing$Labels.
# Columns represent the levels or categories of the variable multinomial.naive.bayes.predictions
#
conf_matrix <- prop.table(table(testing$Labels, multinomial.naive.bayes.predictions))

# Sets the custom names for rows and columns
custom_names <- c("(Pred Neg) 0", "(Pred Neu) 1", "(Pred Pos) 2")
colnames(conf_matrix) <- custom_names

custom_names <- c("(True Neg) 0", "(True Neu) 1", "(True Pos) 2")
rownames(conf_matrix) <- custom_names

cat("Proportion Confusion Matrix:", end="\n\n")
print(conf_matrix)
# TRUE-Negative     False-Neutral     False-Positive
# False-Negative    TRUE-Neutral      False-Positive
# False-Negative    False-Neutral     TRUE-Positive

```

<div style="color:black;background-color:lightyellow">
   <br>
   <font size="+0.3"><span style="color: darkblue">Based on the Confusion Matrices, <strong>2450</strong> of the predictions were <strong>correctly labeled</strong>, accounting for <strong>64.52%</strong> of all predictions. These findings corroborate previous data from this study. Subtracting 2450 from the total predictions made (3797) yields 1347 (the number of incorrect predictions calculated previously), and the percentage obtained from the Confusion Matrices precisely matches the accuracy percentage of the model (without rounding).</span></font>
   <br>
   <br>
   <font size="+0.3"><span style="color: darkblue">Of the incorrectly predicted values, it appears that the model <strong>mislabeled</strong> more <strong>Negative values as Positive and vice versa</strong>. Specifically, 333 Negative values were labeled as Positive, constituting approximately 8.77% of all predictions. This was closely followed by 326 Positive values labeled as Negative, representing around 8.59% of all predictions.</span></font>
   <br>
   <br>
   <font size="+0.3"><span style="color: darkblue"><strong>Note:</strong> Use the following table for the Confusion Matrices interpretation.</span></font>
   <br>
   <br>
   <font size="+0.3"><span style="color: darkblue">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;True-Negative&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;False-Neutral&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;False-Positive</span></font>
   <br>
   <font size="+0.3"><span style="color: darkblue">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;False-Negative&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;True-Neutral&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;False-Positive</span></font>
   <br>
   <font size="+0.3"><span style="color: darkblue">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;False-Negative&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;False-Neutral&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;True-Positive</span></font>
</div>