Overview

This analysis is for the Coursera Data Science Specialization Capstone class. The purpose of this report is to perform exploratory data analysis on text data and to become comfortable with this unique type of data. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). A zip file containing the text data used in this analysis can be downloaded by clicking here. The corpus consists of three english text files: blogs, news stories, and tweets.


Data Processing

After downloading and extracting the zip file consisting of the three text documents, load the corpus into R using the tm package.

corpus_US <- file.path(".", "final", "en_US")
docs <- Corpus(DirSource(corpus_US))

dir(corpus_US)
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Here is a view of several lines of text from each of the three documents in the corpus.

Blogs

## [1] "As adults we ask – and answer – questions and unconsciously try to interpret the background nuances and circumstances and expect others to do the same."                                                                                                                                                                                
## [2] "A couple of months ago I noticed one of the plants in my bedroom window had a little green friend growing next to it. Being of the pacifist persuasion, I let it be."                                                                                                                                                                   
## [3] "Writer Beware has learned that Pearson Education, a major education services company (and the parent company of trade publisher Penguin), is currently requesting vastly extended licenses for copyrighted text and images that it has received permission from rightsholders to include in its print textbooks and other publications."
## [4] "Also this weekend: the first grilling of the season at my mom & dad’s house, the end of one soccer season, and celebratory beers at Three Aces."

News

## [1] "14915 Charlevoix, Detroit"                                                                                                                                                                                                                                                               
## [2] "\"It’s just another in a long line of failed attempts to subsidize Atlantic City,\" said Americans for Prosperity New Jersey Director Steve Lonegan, a conservative who lost to Christie in the 2009 GOP primary. \"The Revel Casino hit the jackpot here at government expense.\""      
## [3] "But time and again in the report, Sullivan called on CPS to correct problems to improve employee accountability, saying, for example, that measures to keep employees from submitting fraudulent invoices or to block employees from accessing inappropriate websites were not in place."
## [4] "\u0093I was just trying to hit it hard someplace,\u0094 said Rizzo, who hit the pitch to the opposite field in left-center. \u0093I\u0092m just up there trying to make good contact.\u0094"                                                                                             
## [5] "MHTA President and CEO Margaret Anderson Kelliher said construction would likely begin soon on a suite of offices on the building's fourth floor near the historic trading floor."

Twitter

## [1] "Cyberdating in China: Woman w/ duck's egg face seeks handsome devil, not from Wuhan, no Virgos. Illuminating piece on \"#romance by"
## [2] "Does anyone else remember when the best place to watch movie trailers was apple.com?"                                               
## [3] "You know me all to well."                                                                                                           
## [4] "please follow Artie so happy to see you again xo"

Below is a table summarizing the unaltered corpus.

Lines Words Characters Longest Line (chrs) Size
en_US.blogs.txt 899,288 37,334,131 206,824,505 40,833 248.5 Mb
en_US.news.txt 1,010,242 34,372,530 203,223,159 11,384 249.6 Mb
en_US.twitter.txt 2,360,148 30,373,543 162,096,031 140 301.4 Mb

After initially analyzing the corpus my Mac ran out of memory processing the data. Thus, it will be necessary to take a small random sample of the corpus to bring it down to a manageble size for my computer.


Data Transformation

I have chosen to take a random sample of 1% of the lines from the combined three text documents in the corpus. I believe this will be a better representation of the English language since the corpus includes professional news stories, blog posts, and tweets from Twitter. Ultimately I belive it will lead to better n-gram predictions because the lines cover a broad range of the english vernacular.

Below is a table summarizing the sample corpus.

Lines Words Characters Longest Line (chrs) Size
Sample Doc 42,698 1,019,281 5,713,219 2,620 159.2 Mb

The next step will be to transform the text of the sample corpus. In its raw state, the corpus has a lot of elements that pollute the text and hinder accurate analysis. Transforamtions to the corpus include:

Below is an example of a line of text before any transformations.

## [1] "Breaking news!! Newt Leads mittens in all major polls 4 wives to 1.. and mittens rebuttles, all 4 of his wives were from 1 mariage.. lol"

Here is the same line after all the transformations.

## [1] "breaking news newt leads mittens major polls wives mittens rebuttles wives mariage lol"

I chose not to stem the corpus. (Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “s”.) In my opinion it didn’t always function properly and I believe the metacharacters in the corpus will provide valuable information for my future n-gram model.


Exploratory Data Analysis

Document Term Matrix

A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix.

dtm <- DocumentTermMatrix(sDocs)

Term Frequencies

The top 10 words and their count.

## will just said  one like  can  get time  new dont 
## 3220 3064 3004 2921 2734 2372 2234 2123 1856 1765

The number of words that appear once through fifteenth. (As in “there are 31,066 words that appear only once”“)

## freq
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 31066  7525  3708  2251  1584  1099   881   713   636   470   453   361 
##    13    14    15 
##   311   292   256

Most Frequent Terms

##      1      2      3      4      5      6      7      8      9      10    
## word "will" "just" "said" "one"  "like" "can"  "get"  "time" "new"  "dont"
## freq "3220" "3064" "3004" "2921" "2734" "2372" "2234" "2123" "1856" "1765"
##      11     12     13     14     15      
## word "now"  "good" "know" "day"  "people"
## freq "1723" "1684" "1681" "1643" "1603"

Letter Analysis

Reduce corpus to words with less than 20 letters (to remove erroneous words).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   7.000   7.671   9.000  19.000
## 
##    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 
## 2191 4131 6259 8178 8658 7746 6310 4729 3079 1986 1168  785  386  290  202 
##   18   19 
##  141   90

Letter Frequency

Word Cloud


Application Design

The basic methodology for the n-gram text prediction is as follows:

  1. Generate 1-gram, bigram, and trigram matrices.
  2. By summing frequency counts, generate a 2-column table of unique ngrams by frequencies (“N-gram Frequency Table”).
  3. Match a n-word character string with the appropriate n+1 gram entry in the N-gram Frequency Table.
  4. If there is a match, propose high frequency words to the user. Continuing the previous example, a match should be the last word of the n-gram.

Appendix

Corpus Statistics Function

summary.corpus <- function(crps, docs = TRUE) {
    lns <- c()
    chrs <- c()
    wrds <- c()
    cll <- c()
    sz <- c()
    nms <- c()
    if (docs) {
        ## If corpus is in Text document format
        for (i in 1:length(crps)) {
            lns[i] <- length(crps[[i]]$content)
            chrs[i] <- sum(nchar(crps[[i]]$content))
            wrds[i] <- length(unlist(strsplit(crps[[i]]$content, " ")))
            cll[i] <- max(nchar(crps[[i]]$content))
            sz[i] <- format(object.size(crps[[i]]$content), units = "Mb")
            nms[i] <- crps[[i]]$meta$id
        }
        sum.crps <- data.frame(lns, wrds, chrs, cll, sz, row.names = nms)
        names(sum.crps) <- c("Lines", "Words", "Characters", "Longest Line (chrs)", "Size")
        return(sum.crps)
    } else {
        ## If corpus is in line format (no documents, just lines)
        x <- sapply(crps, nchar)[1,]
        lns <- length(x)
        chrs <- sum(x)
        wrds <- sum(sapply(crps, function(x) { length(unlist(strsplit(as.character(x), " "))) }) )
        cll <- max(x)
        sz <- format(object.size(crps), units = "Mb")
        sum.crps <- data.frame(lns, wrds, chrs, cll, sz, row.names = "Sample Doc")
        names(sum.crps) <- c("Lines", "Words", "Characters", "Longest Line (chrs)", "Size")
        return(sum.crps)
    }
}
corpusTable <- summary.corpus(docs)
kable(corpusTable, align = "c", format.args = list(big.mark = ','))

sampleTable <- summary.corpus(sample.docs, docs = FALSE)
kable(sampleTable, align = "c", format.args = list(big.mark = ','))

Sample Corpus Function

sample.corpus <- function(crps, size, seed = 1) {
    ## Function to return only sample of data
    ## The purpose of taking a sample of corpus data is to reduce computation time
    require(tm)
    set.seed(seed)
    v <- character(length(crps))
    for (i in 1:length(crps)) {
        v <- c(v, sample(crps[[i]]$content, length(crps[[i]]$content) * size) )
    }
    v <- Corpus(VectorSource(v))
}
sample.docs <- sample.corpus(docs, 0.01, 700)

Corpus Transformation Function

transform.corpus <- function(crps) {
    require(tm)
    # Takes a corpus as input and returns it with desired transformations
    # Convert to lower case
    crps <- tm_map(crps, content_transformer(tolower))
    # Remove numbers
    crps <- tm_map(crps, removeNumbers)
    # Remove punctuation
    crps <- tm_map(crps, removePunctuation)
    # Remove english stop words
    crps <- tm_map(crps, removeWords, stopwords("english"))
    # Remove whitespace
    crps <- tm_map(crps, stripWhitespace)
    # Trim leading and trailing whitespace
    trim <- function(x) {
        gsub("^\\s+|\\s+$", "", x)
    }
    crps <- tm_map(crps, content_transformer(trim))
    # Return corpus
    crps
}
sDocs <- transform.corpus(sample.docs)

#sDocs[[41002]]$content

Most Frequent Terms

wf <- data.frame(word = names(freq), freq = freq) %>% arrange(desc(freq))
t(wf[1:15, ])
# words with over 1000 occurences
subset(wf, freq > 1000) %>%
    ggplot(aes(x = reorder(word, -freq), y = freq)) +
    geom_bar(stat = "identity") +
    theme(axis.text.x=element_text(angle=45, hjust=1, size = 8)) +
    xlab("Top Words") + ylab("Frequency") +
    ggtitle("Words with over 1,000 Occurences")

Letter Analysis

## Reduce data frame to words with less that 20 letters
char.wf <- wf[ nchar(as.character(wf[,1])) < 20,]
summary(nchar(as.character(char.wf[,1])))
char.wf$word <- as.character(char.wf$word)
table(nchar(char.wf[,1]))

data.frame(nletters = nchar(char.wf[,1])) %>%
    ggplot(aes(x = nletters)) +
    geom_histogram(binwidth = 1) +
    geom_vline(xintercept = mean(nchar(char.wf[,1])), color = "blue", size = 1.5, alpha = 0.5) +
    labs(x = "Number of Letters", y = "Number of Words") +
    ggtitle("Word Length Plot")

Letter Frequency

library(stringr)
library(qdap)
lettrs <- wf[,1] %>% str_split("") %>% unlist %>% dist_tab
# found a bunch of non-english letters.Remove all letters that have freq < 700
lettrs <- lettrs[lettrs$freq > 700, ]

ggplot(lettrs, aes(x = reorder(toupper(interval), percent), y = percent)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(x = "Letter", y = "Percent") +
    scale_y_continuous(breaks=seq(0, 12, 2),
                       label=function(x) paste0(x, "%"),
                       expand=c(0,0), limits=c(0,12)) +
    ggtitle("Letter Frequency")

Word Cloud

library(wordcloud)
set.seed(123)
wordcloud(words = wf[,1], freq = wf[,2], max.words = 140, 
          scale = c(5, 0.5), rot.per = 0.15, 
          colors = brewer.pal(6, "Dark2"))