Term Frequency Analysis: Plots, Wordclouds, Word Coverage

For the purposes of our analysis, a small subset of the complete files (Blog, Twitter, and News files) will be used to develop term frequency plots, wordclouds, and word coverage (due to hardware/computational limitations). The subset was randomly sampled from within the ‘Getting and Cleaning the Data.R’ script in this repository.
We will use the qdap package to develop term frequency data from our subset. As our text data is likely to contain main stop words, we will analyze 3 different sets of our data:
- freq_all: All words, no stopwords removed
- freq_TOP200stopwords: The top 100 stopwords (from within ‘qdap’ package) removed
- freq_TMstopwords: All stopwords removed (from the ‘tm’ package)

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

## Loading required package: qdapTools

## Loading required package: RColorBrewer

## Registered S3 methods overwritten by 'qdap':
##   method               from
##   t.DocumentTermMatrix tm  
##   t.TermDocumentMatrix tm

## 
## Attaching package: 'qdap'

## The following object is masked from 'package:base':
## 
##     Filter

Term Frequency: All Words

Term Frequency: Top 100 Stop Words Removed

Term Frequency: All ‘tm’ Stop Words Removed

We will also perform a coverage analysis to determine the amount of unique instances/terms we need to cover certain percentages of the total corpora length. To do this, we develop a function ‘wordcoverage’ which takes the terms (and their frequencies) as input, along with the coverage percentage we wish to analyze and the total count of words in the corpora. In this case, with our cleaned data sitting in the ‘freq_all’ variable, we will simply use the sum of the frequencies in this variable to determine total word count.

library(ngram)
totalwordcount <- sum(freq_allwords[2])

wordcoverage <- function(terms, coverage, totalwords){
      
      if(sum(terms[2]) < totalwords*coverage){
            stop("The frequencies in the terms provided do not exceed the total word count.")
      }
      
      terms[,'Cumulative Freqs'] = cumsum(terms[2])
      index <- min(which(terms[3] > totalwords*coverage))
      
      index
}

coverage_df <- data.frame(
      'Total Instances' = c(
                  sum(freq_allwords[2]), 
                  sum(freq_TOP100stopwords[2]), 
                  sum(freq_TMstopwords[2])
            ),
      'Total Word Count' = totalwordcount,
      'Coverage of total Word Count' = c(
                  sum(freq_allwords[2])/totalwordcount,
                  sum(freq_TOP100stopwords[2])/totalwordcount,
                  sum(freq_TMstopwords[2]/totalwordcount)
            ),
      'Unique Terms' = c(
                  nrow(freq_allwords),
                  nrow(freq_TOP100stopwords),
                  nrow(freq_TMstopwords)
            )
)

rownames(coverage_df) = c("All Words","Top 100 Stopwords Removed", "All tm Stopwords Removed")
coverage_df

##                           Total.Instances Total.Word.Count
## All Words                         1997407          1997407
## Top 100 Stopwords Removed         1582992          1997407
## All tm Stopwords Removed          1076628          1997407
##                           Coverage.of.total.Word.Count Unique.Terms
## All Words                                    1.0000000        48488
## Top 100 Stopwords Removed                    0.7925235        35936
## All tm Stopwords Removed                     0.5390128        35792

From the above, the total terms in our subset corpora is 1,997,407. In addition, with the Top 100 stopwords removed, we only cover ~79% of the total words (as the top stopwords are expectedly high in frequency; this decrease in total coverage is more extreme when we remove all tm stopwords in the 3rd row above).
Now we can use our ‘wordcoverage’ function to determine the total number of unique words we require from each subset to cover ‘X’ % of the total number of word occurrences (i.e., 1,997,407):

##                           X0.25 X0.50 X0.75 X0.90
## All Words                    15   149  1616  6680
## Top 100 Stopwords Removed    61   948 11726    NA
## All tm Stopwords Removed    856 12405    NA    NA

From above, we see that we only require 149 words to cover 50% of the total word occurrences when we do NOT remove any stopwords (first row). However, when we remove all stopwords from the ‘tm’ package, this number to cover 50% of word occurrences jumps to 12,405! From this, we can see the huge frequencies that exist in the top stopwords.
To further explore, we can build wordclouds of the top terms in each set:

Wordcloud: Top 100 Stop Words Removed

Since the top 2 words are stopwords that have such disproportionately high frequencies, we can simply remove them for this wordcloud to get a better ‘overall’ picture:

Wordcloud: All ‘tm’ Stop Words Removed

wordcloud(freq_TMstopwords$WORD,freq_TMstopwords$FREQ, 
          max.words = 50, 
          colors = c("turquoise2","darkgoldenrod1","tomato"))

NLP Project Exploratory Data Analysis

Patrick de Guzman

August 19, 2019

Executive Summary

Analyzing All Files: Basic Summaries