Working with Text Data

Some of the packages that we will use in R to work with text data include tm (preprocessing and transformation), textstem (lemmatization), wordcloud (visualization) and lexicon (sentiment lexicons).

If you do not already have these packages installed, you will need to install them.

install.packages(c("tm", "textstem", "wordcloud", "lexicon"))

To use the packages, we use the library() function.

library(tm)
library(lexicon)
library(wordcloud)
library(textstem)

Data

The data that we will use is a sample of a larger data set containing online reviews for clothing items.

The variables included in the dataset are:

Clothing ID: refers to the specific item being reviewed
Age: reviewers age
Title: title of the review
Review Text: review body
Rating: product score granted by the customer from 1 Worst, to 5 Best
Recommended IND: if the customer recommends the product, where 1 is recommended, 0 is not recommended
Positive Feedback Count: the number of other customers who found this review positive
Division Name: name of the product high level division.
Department Name: name of the product department
Class Name: name of the product class

Data Cleansing & Exploration

str(cr)

## 'data.frame':    4697 obs. of  11 variables:
##  $ ID_No                  : int  14099 7585 5974 2502 9802 19960 8376 2526 20063 9227 ...
##  $ Clothing.ID            : int  686 868 839 1078 583 1141 1092 1078 1081 927 ...
##  $ Age                    : int  28 38 56 50 26 53 35 39 65 59 ...
##  $ Title                  : chr  "Just what i was looking for!" "Love" "Darling summer top" "Great with leggings" ...
##  $ Review.Text            : chr  "This is the perfect lounge/sleep cami! i am 5'1'' 120 lbs and purchased the dark purple small. it's hangs a bit"| __truncated__ "This top is beautiful and i'm in love..but.. it'd huuuuggge! i'm always a large in retailer tops. im going to r"| __truncated__ "I usually try to avoid paying full price for items but saw this in the store and couldn't pass it up! tried it "| __truncated__ "This dress is too short for me to wear as a dress, at 5'7\" and over 40. however is is great with leggings and "| __truncated__ ...
##  $ Rating                 : int  5 5 5 4 5 5 2 5 3 5 ...
##  $ Recommended.IND        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Positive.Feedback.Count: int  0 0 1 13 1 1 0 24 0 0 ...
##  $ Division.Name          : chr  "Initmates" "General" "General" "General Petite" ...
##  $ Department.Name        : chr  "Intimate" "Tops" "Tops" "Dresses" ...
##  $ Class.Name             : chr  "Sleep" "Knits" "Blouses" "Dresses" ...

Empty documents contain no text data. Missing values in character variables are imported by default into R as "". Check for and remove any empty documents that contain NA, "" or a space character, " ".

any(is.na(cr$Review.Text))

## [1] FALSE

any(cr$Review.Text == " " | cr$Review.Text == "")

## [1] TRUE

cr <- cr[cr$Review.Text != "" & cr$Review.Text != " ",]

Missing numeric data is imported into R as NA by default. Check for missing ratings.

any(is.na(cr$Rating))

## [1] FALSE

Online product reviews are known to have a J-shaped distribution, and are heavily biased towards positive ratings. (Read more about this in the article ‘Why do Online Product Reviews have a J-Shaped Distribution? Overcoming Biases in Online Word-of-Mouth Communications’.)

barplot(table(cr$Rating), 
        main = "Rating Distribution",
        xlab = "Rating",
        ylab = "Frequency")

Identify and explore potential factor variables.

facs <- c("Division.Name", "Department.Name", "Class.Name", "Rating", "Clothing.ID", "Recommended.IND")
lapply(cr[facs], function(x) nlevels(factor(x)))

## $Division.Name
## [1] 4
## 
## $Department.Name
## [1] 7
## 
## $Class.Name
## [1] 19
## 
## $Rating
## [1] 5
## 
## $Clothing.ID
## [1] 595
## 
## $Recommended.IND
## [1] 2

facs <- facs[!facs %in% "Clothing.ID"]
lapply(cr[facs], table)

## $Division.Name
## 
##                       General General Petite      Initmates 
##              1           2624           1579            304 
## 
## $Department.Name
## 
##           Bottoms  Dresses Intimate  Jackets     Tops    Trend 
##        1      713     1194      349      190     2037       24 
## 
## $Class.Name
## 
##               Blouses    Dresses Fine gauge  Intimates    Jackets      Jeans 
##          1        618       1194        204         33        124        231 
##      Knits   Layering    Legwear     Lounge  Outerwear      Pants     Shorts 
##        945         35         33        136         66        247         61 
##     Skirts      Sleep   Sweaters       Swim      Trend 
##        174         40        270         72         24 
## 
## $Rating
## 
##    1    2    3    4    5 
##  161  308  549  976 2514 
## 
## $Recommended.IND
## 
##    0    1 
##  792 3716

Convert values of "" to NA for the categorical variables. Then, convert to factors.

cr[facs] <- lapply(cr[facs], function(x)  replace(x, x %in% "", NA))
cr[facs] <- lapply(cr[facs], factor)

Explore descriptive statistics and distribution information for numerical variables.

nums <- c("Age", "Positive.Feedback.Count")
summary(cr[nums])

##       Age        Positive.Feedback.Count
##  Min.   :19.00   Min.   : 0.000         
##  1st Qu.:34.00   1st Qu.: 0.000         
##  Median :41.00   Median : 1.000         
##  Mean   :43.22   Mean   : 2.738         
##  3rd Qu.:52.00   3rd Qu.: 3.000         
##  Max.   :91.00   Max.   :98.000

hist(cr$Age, xlab = "", main = "Age")
hist(cr$Positive.Feedback.Count, xlab="", main = "Positive Feedback")

Text Data Concepts: A Review

Some important text data concepts include:

document: the text data. Can be be a sentence, paragraph, chapter, book, etc. Made up of tokens.
document collection (Corpus): all of the documents in a given data set
tokens: occurrences of a word or phrase, named up of consecutive word sequences of length $n$.
- unigram (1-Grams): single word tokens (default, $n = 1$)
- bigrams (2-Grams): two word tokens ($n = 2$)
- trigrams (3-Grams): three word tokens ($n = 3$)
terms: unique words
dictionary/vocabulary: the collection of terms in a document collection

Text Data Preprocessing

We use the tm package to preprocess and transform text data.

The Corpus() function is used to convert a document collection from a vector (VectorSource()), dataframe (DataframeSource()) or directory (DirSource()). To use a dataframe source to create a Corpus, the dataframe must have the text data in a variable named “text” and each document must have a unique document ID variable named “doc_id”.

The Title and Review.Text variables contain text data, so we will combine them together to create a new variable named text, separating them with a space (" ").

names(cr)[1] <- "doc_id" # Rename ID_No to doc_id
cr$text <- paste(cr$Title, cr$Review.Text, sep = " ")

We convert our dataframe cr to a Corpus from a dataframe source. A Corpus object is a named list object (names = doc_id). Each list item contains content (text) and document-level metadata (meta). The corpus-level meta data (all other variables in the dataframe) can be accessed using the meta() function. The inspect() function can be used to view and overview of individual documents or indexing can be used.

crCorpus <- Corpus(DataframeSource(cr))
inspect(crCorpus[[1]]) # or crCorpus[[1]]$content

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 274
## 
## Just what i was looking for! This is the perfect lounge/sleep cami! i am 5'1'' 120 lbs and purchased the dark purple small. it's hangs a bit loose - my preferred fit for sleeping shirts. good quality, easy washing (machine/delicate), and cool and airy for hot summer nights.

Note: The document IDs are converted to character values. crCorpus[[1]] refers to the document in the first index position. The doc_id of the document in the first index position is 14099.

We can use the getTransformations() function to identify what transformations are available to us in the tm package.

getTransformations()

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

We use the tm_map() function to apply transformations to each of the documents in our corpus

Case Conversion

Sometimes referred to as case folding. We apply tolower() transformation to convert all letters to lowercase.

crCorpus <- tm_map(crCorpus, tolower) 
inspect(crCorpus[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 274
## 
## just what i was looking for! this is the perfect lounge/sleep cami! i am 5'1'' 120 lbs and purchased the dark purple small. it's hangs a bit loose - my preferred fit for sleeping shirts. good quality, easy washing (machine/delicate), and cool and airy for hot summer nights.

Number removal

We apply removeNumbers() to remove numbers

crCorpus <- tm_map(crCorpus, removeNumbers)
inspect(crCorpus[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 269
## 
## just what i was looking for! this is the perfect lounge/sleep cami! i am '''  lbs and purchased the dark purple small. it's hangs a bit loose - my preferred fit for sleeping shirts. good quality, easy washing (machine/delicate), and cool and airy for hot summer nights.

We can handle some punctuation now, and then make decisions about all punctuation based on the type of analysis we plan to do.

punc2Space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
crCorpus <- tm_map(crCorpus, punc2Space, "/")
inspect(crCorpus[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 269
## 
## just what i was looking for! this is the perfect lounge sleep cami! i am '''  lbs and purchased the dark purple small. it's hangs a bit loose - my preferred fit for sleeping shirts. good quality, easy washing (machine delicate), and cool and airy for hot summer nights.

A Detour on Zipfs and Heaps Laws

Let’s create a duplicate Corpus that removes punctuation for a demonstration to motivate the origination and use of stop words. We use removePunctuation() to remove punctuation.

crCorpus_dupe <- tm_map(crCorpus, removePunctuation)
inspect(crCorpus_dupe[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 255
## 
## just what i was looking for this is the perfect lounge sleep cami i am   lbs and purchased the dark purple small its hangs a bit loose  my preferred fit for sleeping shirts good quality easy washing machine delicate and cool and airy for hot summer nights

Let’s view the ten most frequent terms and their number of token frequencies.

head(sort(slam::col_sums(DocumentTermMatrix(crCorpus_dupe)), decreasing = TRUE), n=10)

##   the   and  this   but   for  with   was dress  love   not 
## 15512 10165  5413  3530  3067  2600  2413  2399  2160  2154

Term frequency does not equate to term importance. Why not?

Heap’s Law

Heap’s Law states that the number terms, $V$ in a document collection that contains $N$ tokens is approximately $\sqrt(N)$. More specifically, vocabulary/dictionary size is a function of the number of tokens in the document collection. More documents -> more terms. If Heap’s Law holds, we should expect the slope of the line to be approximately 0.5.

Heaps_plot(DocumentTermMatrix(crCorpus_dupe))

## (Intercept)           x 
##   2.6042551   0.5207744

Heap’s Law, which typically holds, underscores the importance of dictionary compression – preprocessing steps can reduce the growth rate of the vocabulary.

Zipf’s Law

Zipf’s Law models the distribution of terms (unique tokens) in a document collection. It states that the frequency of any term is inversely proportional to its frequency rank. If $t_1$ is the most common term and $t_2$ is the second most common term, etc., then the collection (global) frequency , $cf_i$ of the $i$th most common term is proportional to ${1}/{i}$. If Zipf’s law holds, we expect a line with a slope of -1.

Zipf_plot(DocumentTermMatrix(crCorpus_dupe))

## (Intercept)           x 
##   13.136256   -1.507185

Zipf’s Law has been proven to hold in many document collections and in many languages. Zipf’s Law motivates the use of stop word lists in text analysis.

Intraword punctuation should be preserved until after stop word removal. Why?

The stopwords() function can be used to view and use stop word lists. The two that we will consider are en (English) and SMART.

length(stopwords("en"))

## [1] 174

length(stopwords("SMART"))

## [1] 571

# Words in en list that are not in SMART
stopwords("en")[!stopwords("en") %in% stopwords("SMART")]

##  [1] "she's"   "he'd"    "she'd"   "he'll"   "she'll"  "shan't"  "mustn't"
##  [8] "when's"  "why's"   "how's"

# Common Words in both
stopwords("en")[stopwords("en") %in% stopwords("SMART")]

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "it's"       "we're"      "they're"    "i've"       "you've"    
##  [66] "we've"      "they've"    "i'd"        "you'd"      "we'd"      
##  [71] "they'd"     "i'll"       "you'll"     "we'll"      "they'll"   
##  [76] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [81] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [86] "won't"      "wouldn't"   "shouldn't"  "can't"      "cannot"    
##  [91] "couldn't"   "let's"      "that's"     "who's"      "what's"    
##  [96] "here's"     "there's"    "where's"    "a"          "an"        
## [101] "the"        "and"        "but"        "if"         "or"        
## [106] "because"    "as"         "until"      "while"      "of"        
## [111] "at"         "by"         "for"        "with"       "about"     
## [116] "against"    "between"    "into"       "through"    "during"    
## [121] "before"     "after"      "above"      "below"      "to"        
## [126] "from"       "up"         "down"       "in"         "out"       
## [131] "on"         "off"        "over"       "under"      "again"     
## [136] "further"    "then"       "once"       "here"       "there"     
## [141] "when"       "where"      "why"        "how"        "all"       
## [146] "any"        "both"       "each"       "few"        "more"      
## [151] "most"       "other"      "some"       "such"       "no"        
## [156] "nor"        "not"        "only"       "own"        "same"      
## [161] "so"         "than"       "too"        "very"

For Sentiment Analysis–stop words should not be removed. Why?

Remove “en” stop words and save as separate Corpus for non sentiment analysis

crCorpus_no_SA <- tm_map(crCorpus, content_transformer(function(x) removeWords(x, stopwords("en"))))
inspect(crCorpus_no_SA[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 221
## 
## just    looking !    perfect lounge sleep cami!   '''  lbs  purchased  dark purple small.  hangs  bit loose -  preferred fit  sleeping shirts. good quality, easy washing (machine delicate),  cool  airy  hot summer nights.

Custom stop words can be removed by adding them to removeWords as a vector (c()) that includes stopwords(). In addition to removing custom stop words, misspelled words can be replaced using the replace_misspelling() function in the textclean package, which relies on the popular Hunspell spell checker and morphological analyzer.

Punctuation

The removePunctuation() function is used to remove punctuation and special characters from the text documents. By default, the characters removed are ASCII (ucp = FALSE) (! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ { | } ~), but can be changed to unicode (ucp = TRUE). The arguments preserve_intra_word_contractions (“she’s”, “dog’s”) and preserve_intra_word_dashes (“high-level”, “x-small”) default to FALSE.

Corpus for non sentiment analysis

crCorpus_no_SA <- tm_map(crCorpus_no_SA, 
                         removePunctuation, preserve_intra_word_contractions = FALSE, 
preserve_intra_word_dashes = TRUE)
inspect(crCorpus_no_SA[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 208
## 
## just    looking     perfect lounge sleep cami     lbs  purchased  dark purple small  hangs  bit loose   preferred fit  sleeping shirts good quality easy washing machine delicate  cool  airy  hot summer nights

Corpus for sentiment analysis

crCorpus <- tm_map(crCorpus, 
                         removePunctuation, 
                         preserve_intra_word_contractions = FALSE, 
                         preserve_intra_word_dashes = FALSE)
inspect(crCorpus[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 255
## 
## just what i was looking for this is the perfect lounge sleep cami i am   lbs and purchased the dark purple small its hangs a bit loose  my preferred fit for sleeping shirts good quality easy washing machine delicate and cool and airy for hot summer nights

If the text data is obtained from social media, initial cleansing may need to be done to remove links, hashtags, retweets, social media handles, before more general punctuation is handled. More information about regular expressions can be found here.

# remove links
crCorpus <- tm::tm_map(crCorpus, tm::content_transformer(function(x) gsub("http[^[:space:]]*", " ", x)))
# remove retweets
crCorpus <- tm::tm_map(crCorpus, tm::content_transformer(function(x) gsub('\\b+RT', " ", x)))
# remove mentions
crCorpus <- tm::tm_map(crCorpus, tm::content_transformer(function(x) gsub('@\\S+', " ", x)))
# remove hashtags
crCorpus <- tm::tm_map(crCorpus, tm::content_transformer(function(x) gsub('#\\S+', " ", x)))

Remove Excess White Space

White space characters can include: spaces, line feeds, tabs, carriage returns, and form feeds.

Non-Sentiment Analysis Corpus

crCorpus_no_SA <- tm_map(crCorpus_no_SA, stripWhitespace)
inspect(crCorpus_no_SA[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 187
## 
## just looking perfect lounge sleep cami lbs purchased dark purple small hangs bit loose preferred fit sleeping shirts good quality easy washing machine delicate cool airy hot summer nights

Sentiment Analysis Corpus

crCorpus <- tm_map(crCorpus, stripWhitespace)
inspect(crCorpus[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 252
## 
## just what i was looking for this is the perfect lounge sleep cami i am lbs and purchased the dark purple small its hangs a bit loose my preferred fit for sleeping shirts good quality easy washing machine delicate and cool and airy for hot summer nights

wordcloud(crCorpus_no_SA, 
          min.freq = 250,
          max.words = 100,
          random.order = FALSE,
          random.color = FALSE,
          colors = brewer.pal(8, "Dark2"))

wordcloud(crCorpus, 
          min.freq = 10,
          max.words = 250,
          random.order = FALSE,
          random.color = FALSE,
          colors = brewer.pal(8, "Dark2"))

Non SA and SA Wordclouds

Stemming & Lemmatization

For non-SA, we can further compress the vocabulary by using either stemming or lemmatization. Both methods aim to reduce the number of terms in the vocabulary based on inflections (number, tense, gender, etc.).

Why should we not perform stemming or lemmatization for sentiment analysis?

Stemming

The stem of a word is the word form before any inflections are added. Stemming removes a word’s suffix (ending), such as es, s, ing, ed, y, based on an heuristic algorithm. After the suffix is removed, a term is reduced to its base, root or stem word. Common stemming algorithms include:

Lovins (1968)
Porter (1980)
Paice (1990, 1994)

The default stemming in the tm package in R is Porter and relies on the SnowballC package, which can apply stemming in 24 languages.

sapply(c("operates", "operating", "operation", "operational", "operator", "operators", "operative", "operatives"), stem_words)

##    operates   operating   operation operational    operator   operators 
##      "oper"      "oper"      "oper"      "oper"      "oper"      "oper" 
##   operative  operatives 
##      "oper"      "oper"

When applying stemming (and lemmatization) to a Corpus, it is recommended that a duplicate Corpus is created.

The stemDocument() function can be used to apply stemming to the Corpus.

crCorpus_stem <- tm_map(crCorpus_no_SA, stemDocument)
inspect(crCorpus_stem[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 163
## 
## just look perfect loung sleep cami lbs purchas dark purpl small hang bit loos prefer fit sleep shirt good qualiti easi wash machin delic cool airi hot summer night

Lemmatization

A lemma is the basic word form. Lemmatization is the process of reducing a word to its base form while incorporating information about the word’s part of speech (POS) through morphological analysis.

sapply(c("operates", "operating", "operation", "operational", "operator", "operators", "operative", "operatives"), lemmatize_words)

##      operates     operating     operation   operational      operator 
##     "operate"     "operate"   "operation" "operational"    "operator" 
##     operators     operative    operatives 
##    "operator"   "operative"   "operative"

The lemmatize_strings() function from the textstem package can be used to apply lemmatization to the Corpus (using the lemmas in the lexicon package).

crCorpus_lem <- tm_map(crCorpus_no_SA, lemmatize_strings)
inspect(crCorpus_lem[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 171
## 
## just look perfect lounge sleep cami lb purchase dark purple small hang bite loose prefer fit sleep shirt good quality easy wash machine delicate cool airy hot summer night

Document-Term (or Term-Document) Representation

In traditional text analysis, text is represented using a Bag of Words (BOW) modeling assumption, in which term occurrence, rather than order and structure, is modeled.

The TermDocumentMatrix() and DocumentTermMatrix() functions can be used to create Term-Document (TDM) and Document-Term (DTM) Matrices, respectively. These functions create a simple triplet matrix using the slam package. The slam package has some convenience functions that are compatible with this type of object, such as: row_sums(), col_sums(), row_means() and col_means().

cr_DTM_lem <- DocumentTermMatrix(crCorpus_lem)
cr_DTM_lem

## <<DocumentTermMatrix (documents: 4508, terms: 5305)>>
## Non-/sparse entries: 118956/23795984
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

The Terms() function can be used to view the Terms in a DTM or TDM. The nTerms() and nDocs() functions can be used to identify the number of terms and documents, respectively.

The default weighting used is term frequency (weighting = weightTf). Weighting can be applied in order to:

Reduce the variability in term frequencies (log frequency, lw_logtf() function in the lsa package)
Capture occurrence instead of frequency (binary frequency, weighting = weightBin or lw_bintf() in the lsa package)
Balance term importance locally and globally (term frequency-inverse document frequency (tf-idf), weighting = weightTfIdf or lw_tf()*gw_idf() in the lsa package)

Since term frequency does not equate to term importance, alternate weighting should be used when you are not performing 1) lexicon-based matching, such as Sentiment Analysis and 2) analysis that relies on co-occurrence, such as Topic Models.

tf-idf is a popular combinatorial weighting approach that combines term frequency, a local weighting method and inverse document frequency, a global weighting method. Inverse document frequency (idf) gives higher weight to rare terms and lower weight to frequent terms. tf-idf is high when a term occurs many times in a few documents and is low when a term occurs in all, most or many documents.

The weighting can either be applied when creating the DTM using the weighting argument or it can be assigned using the weightTfIdf() function. The default behavior of the weightTfidf function is normalize=TRUE, which involves dividing by the number of terms in the documents–thereby normalizing by document length (number of terms in each document).

To avoid dividing by 0 when normalizing, empty documents need to be removed. Empty documents are those documents that do not contain any terms after preprocessing.

Note: In addition to when weighting is applied, you should also check for empty documents after dimension reduction and/or feature selection methods are applied (before any analysis is conducted). Empty documents should be removed in the DTM and data.

To check for empty documents in the DTM:

cr_DTM_lem[apply(cr_DTM_lem, 1, sum) == 0,]

## <<DocumentTermMatrix (documents: 0, terms: 5305)>>
## Non-/sparse entries: 0/0
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

If empty documents are found, they can be removed using:

cr_DTM_lem <- cr_DTM_lem[apply(cr_DTM_lem, 1, sum) > 0,]

Applying tf-idf with normalization to the lemmatized DTM:

DTM_lem_tfidf <- weightTfIdf(cr_DTM_lem)
DTM_lem_tfidf

## <<DocumentTermMatrix (documents: 4508, terms: 5305)>>
## Non-/sparse entries: 118956/23795984
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Unsupervised Dimension Reduction & Feature Selection

There are 5305 terms and 4508 documents in our lemmatized, tf-idf weighted DTM.

Unsupervised dimension reduction/feature selection methods for text data include sparsity reduction and document frequency thresholding.

Sparsity Reduction: DTM and TDMs are inherently sparse matrices. The average term sparsity in our DTM is 0.9958. We can use the removeSparseTerms() function on a DTM or TDM object to eliminate terms that are more sparse than a provided sparsity threshold (sparse).

DTM_l_tfidf_sr <- removeSparseTerms(DTM_lem_tfidf, sparse = .99)

Document frequency thresholding: We can choose to keep terms based on how many documents the terms occur in. Choosing a cutoff threshold, $\alpha$, we can eliminate terms that do not meet the threshold.

doc_freq <- sort(slam::col_sums(DTM_lem_tfidf != 0), decreasing = TRUE)
head(doc_freq)

##  love   fit  wear  size  look dress 
##  1879  1824  1584  1557  1475  1373

# threshold chosen to be top 5%
terms_red <- doc_freq[1:ceiling(length(doc_freq)*0.05)]
DTM_red_df <- DTM_lem_tfidf[,DTM_lem_tfidf$dimnames$Terms %in% names(terms_red)]