This analysis is for the Coursera Data Science Specialization Capstone class. The purpose of this report is to perform exploratory data analysis on text data and to become comfortable with this unique type of data. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). A zip file containing the text data used in this analysis can be downloaded by clicking here. The corpus consists of three english text files: blogs, news stories, and tweets.
After downloading and extracting the zip file consisting of the three text documents, load the corpus into R using the tm package.
corpus_US <- file.path(".", "final", "en_US")
docs <- Corpus(DirSource(corpus_US))
dir(corpus_US)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Here is a view of several lines of text from each of the three documents in the corpus.
## [1] "As adults we ask – and answer – questions and unconsciously try to interpret the background nuances and circumstances and expect others to do the same."
## [2] "A couple of months ago I noticed one of the plants in my bedroom window had a little green friend growing next to it. Being of the pacifist persuasion, I let it be."
## [3] "Writer Beware has learned that Pearson Education, a major education services company (and the parent company of trade publisher Penguin), is currently requesting vastly extended licenses for copyrighted text and images that it has received permission from rightsholders to include in its print textbooks and other publications."
## [4] "Also this weekend: the first grilling of the season at my mom & dad’s house, the end of one soccer season, and celebratory beers at Three Aces."
## [1] "14915 Charlevoix, Detroit"
## [2] "\"It’s just another in a long line of failed attempts to subsidize Atlantic City,\" said Americans for Prosperity New Jersey Director Steve Lonegan, a conservative who lost to Christie in the 2009 GOP primary. \"The Revel Casino hit the jackpot here at government expense.\""
## [3] "But time and again in the report, Sullivan called on CPS to correct problems to improve employee accountability, saying, for example, that measures to keep employees from submitting fraudulent invoices or to block employees from accessing inappropriate websites were not in place."
## [4] "\u0093I was just trying to hit it hard someplace,\u0094 said Rizzo, who hit the pitch to the opposite field in left-center. \u0093I\u0092m just up there trying to make good contact.\u0094"
## [5] "MHTA President and CEO Margaret Anderson Kelliher said construction would likely begin soon on a suite of offices on the building's fourth floor near the historic trading floor."
## [1] "Cyberdating in China: Woman w/ duck's egg face seeks handsome devil, not from Wuhan, no Virgos. Illuminating piece on \"#romance by"
## [2] "Does anyone else remember when the best place to watch movie trailers was apple.com?"
## [3] "You know me all to well."
## [4] "please follow Artie so happy to see you again xo"
Below is a table summarizing the unaltered corpus.
| Lines | Words | Characters | Longest Line (chrs) | Size | |
|---|---|---|---|---|---|
| en_US.blogs.txt | 899,288 | 37,334,131 | 206,824,505 | 40,833 | 248.5 Mb |
| en_US.news.txt | 1,010,242 | 34,372,530 | 203,223,159 | 11,384 | 249.6 Mb |
| en_US.twitter.txt | 2,360,148 | 30,373,543 | 162,096,031 | 140 | 301.4 Mb |
After initially analyzing the corpus my Mac ran out of memory processing the data. Thus, it will be necessary to take a small random sample of the corpus to bring it down to a manageble size for my computer.
I have chosen to take a random sample of 1% of the lines from the combined three text documents in the corpus. I believe this will be a better representation of the English language since the corpus includes professional news stories, blog posts, and tweets from Twitter. Ultimately I belive it will lead to better n-gram predictions because the lines cover a broad range of the english vernacular.
Below is a table summarizing the sample corpus.
| Lines | Words | Characters | Longest Line (chrs) | Size | |
|---|---|---|---|---|---|
| Sample Doc | 42,698 | 1,019,281 | 5,713,219 | 2,620 | 159.2 Mb |
The next step will be to transform the text of the sample corpus. In its raw state, the corpus has a lot of elements that pollute the text and hinder accurate analysis. Transforamtions to the corpus include:
Below is an example of a line of text before any transformations.
## [1] "Breaking news!! Newt Leads mittens in all major polls 4 wives to 1.. and mittens rebuttles, all 4 of his wives were from 1 mariage.. lol"
Here is the same line after all the transformations.
## [1] "breaking news newt leads mittens major polls wives mittens rebuttles wives mariage lol"
I chose not to stem the corpus. (Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “s”.) In my opinion it didn’t always function properly and I believe the metacharacters in the corpus will provide valuable information for my future n-gram model.
A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix.
dtm <- DocumentTermMatrix(sDocs)
The top 10 words and their count.
## will just said one like can get time new dont
## 3220 3064 3004 2921 2734 2372 2234 2123 1856 1765
The number of words that appear once through fifteenth. (As in “there are 31,066 words that appear only once”“)
## freq
## 1 2 3 4 5 6 7 8 9 10 11 12
## 31066 7525 3708 2251 1584 1099 881 713 636 470 453 361
## 13 14 15
## 311 292 256
## 1 2 3 4 5 6 7 8 9 10
## word "will" "just" "said" "one" "like" "can" "get" "time" "new" "dont"
## freq "3220" "3064" "3004" "2921" "2734" "2372" "2234" "2123" "1856" "1765"
## 11 12 13 14 15
## word "now" "good" "know" "day" "people"
## freq "1723" "1684" "1681" "1643" "1603"
Reduce corpus to words with less than 20 letters (to remove erroneous words).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 6.000 7.000 7.671 9.000 19.000
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 2191 4131 6259 8178 8658 7746 6310 4729 3079 1986 1168 785 386 290 202
## 18 19
## 141 90
The basic methodology for the n-gram text prediction is as follows:
summary.corpus <- function(crps, docs = TRUE) {
lns <- c()
chrs <- c()
wrds <- c()
cll <- c()
sz <- c()
nms <- c()
if (docs) {
## If corpus is in Text document format
for (i in 1:length(crps)) {
lns[i] <- length(crps[[i]]$content)
chrs[i] <- sum(nchar(crps[[i]]$content))
wrds[i] <- length(unlist(strsplit(crps[[i]]$content, " ")))
cll[i] <- max(nchar(crps[[i]]$content))
sz[i] <- format(object.size(crps[[i]]$content), units = "Mb")
nms[i] <- crps[[i]]$meta$id
}
sum.crps <- data.frame(lns, wrds, chrs, cll, sz, row.names = nms)
names(sum.crps) <- c("Lines", "Words", "Characters", "Longest Line (chrs)", "Size")
return(sum.crps)
} else {
## If corpus is in line format (no documents, just lines)
x <- sapply(crps, nchar)[1,]
lns <- length(x)
chrs <- sum(x)
wrds <- sum(sapply(crps, function(x) { length(unlist(strsplit(as.character(x), " "))) }) )
cll <- max(x)
sz <- format(object.size(crps), units = "Mb")
sum.crps <- data.frame(lns, wrds, chrs, cll, sz, row.names = "Sample Doc")
names(sum.crps) <- c("Lines", "Words", "Characters", "Longest Line (chrs)", "Size")
return(sum.crps)
}
}
corpusTable <- summary.corpus(docs)
kable(corpusTable, align = "c", format.args = list(big.mark = ','))
sampleTable <- summary.corpus(sample.docs, docs = FALSE)
kable(sampleTable, align = "c", format.args = list(big.mark = ','))
sample.corpus <- function(crps, size, seed = 1) {
## Function to return only sample of data
## The purpose of taking a sample of corpus data is to reduce computation time
require(tm)
set.seed(seed)
v <- character(length(crps))
for (i in 1:length(crps)) {
v <- c(v, sample(crps[[i]]$content, length(crps[[i]]$content) * size) )
}
v <- Corpus(VectorSource(v))
}
sample.docs <- sample.corpus(docs, 0.01, 700)
transform.corpus <- function(crps) {
require(tm)
# Takes a corpus as input and returns it with desired transformations
# Convert to lower case
crps <- tm_map(crps, content_transformer(tolower))
# Remove numbers
crps <- tm_map(crps, removeNumbers)
# Remove punctuation
crps <- tm_map(crps, removePunctuation)
# Remove english stop words
crps <- tm_map(crps, removeWords, stopwords("english"))
# Remove whitespace
crps <- tm_map(crps, stripWhitespace)
# Trim leading and trailing whitespace
trim <- function(x) {
gsub("^\\s+|\\s+$", "", x)
}
crps <- tm_map(crps, content_transformer(trim))
# Return corpus
crps
}
sDocs <- transform.corpus(sample.docs)
#sDocs[[41002]]$content
wf <- data.frame(word = names(freq), freq = freq) %>% arrange(desc(freq))
t(wf[1:15, ])
# words with over 1000 occurences
subset(wf, freq > 1000) %>%
ggplot(aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45, hjust=1, size = 8)) +
xlab("Top Words") + ylab("Frequency") +
ggtitle("Words with over 1,000 Occurences")
## Reduce data frame to words with less that 20 letters
char.wf <- wf[ nchar(as.character(wf[,1])) < 20,]
summary(nchar(as.character(char.wf[,1])))
char.wf$word <- as.character(char.wf$word)
table(nchar(char.wf[,1]))
data.frame(nletters = nchar(char.wf[,1])) %>%
ggplot(aes(x = nletters)) +
geom_histogram(binwidth = 1) +
geom_vline(xintercept = mean(nchar(char.wf[,1])), color = "blue", size = 1.5, alpha = 0.5) +
labs(x = "Number of Letters", y = "Number of Words") +
ggtitle("Word Length Plot")
library(stringr)
library(qdap)
lettrs <- wf[,1] %>% str_split("") %>% unlist %>% dist_tab
# found a bunch of non-english letters.Remove all letters that have freq < 700
lettrs <- lettrs[lettrs$freq > 700, ]
ggplot(lettrs, aes(x = reorder(toupper(interval), percent), y = percent)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Letter", y = "Percent") +
scale_y_continuous(breaks=seq(0, 12, 2),
label=function(x) paste0(x, "%"),
expand=c(0,0), limits=c(0,12)) +
ggtitle("Letter Frequency")
library(wordcloud)
set.seed(123)
wordcloud(words = wf[,1], freq = wf[,2], max.words = 140,
scale = c(5, 0.5), rot.per = 0.15,
colors = brewer.pal(6, "Dark2"))