The Library read in

So in terms of predictors, there are a lot of options with cloud tie ins, but for this specific class, I wanted to try and implement an entirely on-prem solution.

library(readr)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tm)
## Warning: package 'tm' was built under R version 4.1.3
## Loading required package: NLP
library(SnowballC)

Data Read in

First things first, we’re going to grab our sample data from Kaggle (https://www.kaggle.com/datasets/kazanova/sentiment140)

Then we are going to read in the first 10,000 rows as the entire set is massive. In order to make sure the rows are reasonably random, we’re going to sample_n them to ensure we get a representative set.

raw <- read_csv("source_data.csv", col_names=c("Sentiment", "id", "dt", "status", "User", "tweet"))
## Rows: 1600000 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): dt, status, User, tweet
## dbl (2): Sentiment, id
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw <- sample_n(raw, 10000)
raw

Corpus creation

At this point, we will create a corpus from the representative sample of data.

corpus = Corpus(VectorSource(raw$tweet))
corpus[[1]][1]
## $content
## [1] "@PsychedelicBabe oh yeah, I'm listening to AD(lightning bolt slash)DC [[AC/DC]], I got all their CD as a gift"

Corpus Cleaning

Like any good data pipeline process, first you read the data, then you subset it (if needed), then clean the data & finally, process it. Here, we’re removing punctuation, tokenizing non-utf8 characters, english stopwords, and finally we’re getting to the root of every words (ie decreasing variability since)

#my_stopwords <- c("á","€")
#corpus <- tm_map(corpus, removeWords, my_stopwords)

corpus = tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus = tm_map(corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
## Warning in tm_map.SimpleCorpus(corpus, function(x) iconv(enc2utf8(x), sub =
## "byte")): transformation drops documents
corpus <- tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
corpus = tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents
corpus[[1]][1]  
## $content
## [1] "PsychedelicBab oh yeah Im listen ADlightn bolt slashDC ACDC I got CD gift"

Document Term Matrix

DTM Creation

At this point we’re going to create a DTM from a corpus.

frequencies = DocumentTermMatrix(corpus)

DTM Cleaning

Then we are going to remove terms that do not occur frequently

reduced = removeSparseTerms(frequencies, 0.995)

Now we are ging to convert the DTM into a matrix

reducedDf = as.data.frame(as.matrix(reduced))

colnames(reducedDf) = make.names(colnames(reducedDf))

And after some prettifying, here is what that matrix looks like.

reducedDf

Baseline accuracy

Practically we are going to augment the sentiment back in, so that we can figure out how the high probability words determine sentiment.

reducedDf$recommended_id = raw$Sentiment
prop.table(table(reducedDf$recommended_id)) 
## 
##      0      4 
## 0.5041 0.4959

Test/Train split

At this point we have our data with sentiment, we’re going to split it into test and train data.

library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
set.seed(100)

split = sample.split(reducedDf$recommended_id, SplitRatio = 0.7)

train = subset(reducedDf, split==TRUE)

test = subset(reducedDf, split==FALSE)

Random Forest

We’re going to use random forest as the classifier as it is a solid broad approach.

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
set.seed(100)

train$recommended_id = as.factor(train$recommended_id)

test$recommended_id = as.factor(test$recommended_id )

 

#Lines 5 to 7

RF_model = randomForest(recommended_id ~ ., data=train)

predictRF = predict(RF_model, newdata=test)

classifier_result = table(test$recommended_id, predictRF)

Metrics

So I wanted to see how effective my classifier was as well as how accurate the data was. First things first, in our sample, we had about a 50/50 split in regards to the distrubution of positive vs negative. The results of our classifier was around 20% better, which is not really good.

So first things first, I wanted to look at the precision and recall, which is inline but doesn’t show issues predicting a specific class.

Then I moved to kappa, which is a metric comparing the prediction and the labels. THe lower the kappa value, the more liekly a random classifier would be beneficial. The Kappa value for this set is fairly low at 37.6%

classifier_result = table(test$recommended_id, predictRF)
 n = sum(classifier_result) # number of instances
 nc = nrow(classifier_result) # number of classes
 diag = diag(classifier_result) # number of correctly classified instances per class 
 rowsums = apply(classifier_result, 1, sum) # number of instances per class
 colsums = apply(classifier_result, 2, sum) # number of predictions per class
 p = rowsums / n # distribution of instances over the actual classes
 q = colsums / n # distribution of instances over the predicted classes
 
 
 accuracy = sum(diag) / n 
 accuracy
## [1] 0.6896667
 precision = diag / colsums 
 recall = diag / rowsums 
 f1 = 2 * precision * recall / (precision + recall) 

 data.frame(precision, recall, f1) 
  expAccuracy = sum(p*q)
 kappa = (accuracy - expAccuracy) / (1 - expAccuracy)

  kappa 
## [1] 0.3793102

Conclusion

In conclusion, this dataset did a fantastic job of demonstrating how a classifier can work, but it really needs more tuning to ensure a higher degree of accuracy. With the low Kappa score, it indicates that tweets are hard to consistently gauge sentiment on in regards to this methodogy, and a different approach may be better.

References

Data: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.