So in terms of predictors, there are a lot of options with cloud tie ins, but for this specific class, I wanted to try and implement an entirely on-prem solution.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tm)
## Warning: package 'tm' was built under R version 4.1.3
## Loading required package: NLP
library(SnowballC)
First things first, we’re going to grab our sample data from Kaggle (https://www.kaggle.com/datasets/kazanova/sentiment140)
Then we are going to read in the first 10,000 rows as the entire set is massive. In order to make sure the rows are reasonably random, we’re going to sample_n them to ensure we get a representative set.
raw <- read_csv("source_data.csv", col_names=c("Sentiment", "id", "dt", "status", "User", "tweet"))
## Rows: 1600000 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): dt, status, User, tweet
## dbl (2): Sentiment, id
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw <- sample_n(raw, 10000)
raw
corpus = Corpus(VectorSource(raw$tweet))
corpus[[1]][1]
## $content
## [1] "@PsychedelicBabe oh yeah, I'm listening to AD(lightning bolt slash)DC [[AC/DC]], I got all their CD as a gift"
Like any good data pipeline process, first you read the data, then you subset it (if needed), then clean the data & finally, process it. Here, we’re removing punctuation, tokenizing non-utf8 characters, english stopwords, and finally we’re getting to the root of every words (ie decreasing variability since)
#my_stopwords <- c("á","€")
#corpus <- tm_map(corpus, removeWords, my_stopwords)
corpus = tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus = tm_map(corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
## Warning in tm_map.SimpleCorpus(corpus, function(x) iconv(enc2utf8(x), sub =
## "byte")): transformation drops documents
corpus <- tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
corpus = tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents
corpus[[1]][1]
## $content
## [1] "PsychedelicBab oh yeah Im listen ADlightn bolt slashDC ACDC I got CD gift"
At this point we’re going to create a DTM from a corpus.
frequencies = DocumentTermMatrix(corpus)
Then we are going to remove terms that do not occur frequently
reduced = removeSparseTerms(frequencies, 0.995)
Now we are ging to convert the DTM into a matrix
reducedDf = as.data.frame(as.matrix(reduced))
colnames(reducedDf) = make.names(colnames(reducedDf))
And after some prettifying, here is what that matrix looks like.
reducedDf