The goal of this Capstone class is to apply data science in the area of natural language processing and to work on understanding and building predictive text models like those used by SwiftKey.
The training data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The data were downloaded from the coursera website.
This report is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. I will also explain my exploratory analysis and my goals for the eventual app and algorithm. My exploratory analysis begins with loading in the needed packages.
Once the zip files ar downloaded, I can read them in as follows:
setwd("~/Desktop/Capstone/final/en_US")
Blog <- readLines("en_US.blogs.txt", encoding="UTF-8", warn=FALSE, skipNul=TRUE)
Twitter <- readLines("en_US.twitter.txt", encoding="UTF-8", warn=FALSE, skipNul=TRUE)
News <- readLines("en_US.news.txt", encoding="UTF-8", warn=FALSE, skipNul=TRUE)
Here is a bit of characterization of the different text data types in the data set.
summary(Blog)
## Length Class Mode
## 899288 character character
summary(Twitter)
## Length Class Mode
## 2360148 character character
summary(News)
## Length Class Mode
## 1010242 character character
head(Blog,3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
head(Twitter, 3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
head(News, 3)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
These samples demonstrate the need to handle different data types according to their format. This will guide my coding when I am developing my app.
Because data sets are so large, I will create a Corpus using a subset of the data in order to demonstrate how the data can be cleaned through Corpus, and show ngrams that will translate for the entire data set.
First,I select a subset of data and place them into a Corpus.
BlogChunk <- sample(Blog, NROW(Blog)/200, replace=FALSE)
TwitterChunk <- sample(Twitter, NROW(Twitter)/200, replace=FALSE)
NewsChunk <- sample(News, NROW(News)/200, replace=FALSE)
CombineChunks <- list(BlogChunk,TwitterChunk,NewsChunk)
TextData <- VCorpus(VectorSource(CombineChunks))
Next, I clean up extra spaces and fix formatting. Because of the final goals of this project, I don’t want to eliminate too many word types. In creating the predictive app, all word types will be included in the predicted phrases.
TextData <- tm_map(TextData, stripWhitespace)
TextData <- tm_map(TextData, content_transformer(tolower))
TextData <- tm_map(TextData, content_transformer(replace_contraction))
toSpace <-content_transformer(function(x, pattern) gsub(pattern," ", x))
TextData <- tm_map(TextData, toSpace, "/|@|\\|")
TextData <- tm_map(TextData, content_transformer(PlainTextDocument))
By creating an ngram Function and WordClouds, we can visually characterize the types of words and phrases that are most common.
nGramFun <- function(ng) {
options(mc.cores=1)
NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng))
tdmFun <- TermDocumentMatrix(TextData, control=list(tokenize=NgramTokenizer))
tdmFun <- as.data.frame(apply(tdmFun,1,sum))
colnames(tdmFun) <- c("Count")
return(tdmFun)
}
# Run nGram levels 1, 2 and 3
tdm_1 <- nGramFun(1)
tdm_2 <- nGramFun(2)
tdm_3 <- nGramFun(3)
tdm_4 <- nGramFun(4)
Word Cloud can display the words, 2-word phrases, 3-word phrases, and 4-word phrases that most frequently occur.
wordcloud(words = row.names(tdm_1), freq = tdm_1[,1], random.order = FALSE, min.freq = 1, max.words=100, colors=brewer.pal(9,"Set1"), scale=c(5,.5),rot.per=.35)
wordcloud(words = row.names(tdm_2), freq = tdm_2[,1], random.order = FALSE, min.freq = 1, max.words=50, colors=brewer.pal(9,"Set1"), scale=c(5,.5),rot.per=.35)
wordcloud(words = row.names(tdm_3), freq = tdm_3[,1], random.order = FALSE, min.freq = 1, max.words=50, colors=brewer.pal(9,"Set1"), scale=c(5,.5),rot.per=.35)
wordcloud(words = row.names(tdm_4), freq = tdm_4[,1], random.order = FALSE, min.freq = 1, max.words=50, colors=brewer.pal(9,"Set1"), scale=c(5,.5),rot.per=.35)
These outcomes of data analysis will help guide me in the next steps of the Capstone. I will use ngram to build a basic predictive model to put into my shiny app.