The aim of this report is to describe the three files that will be used to build the corpus employed to model a predictive algorithm for Swiftkey. This report does the following:
* Downloads the english text data sets provided from news feeds, blogs and twitter.
* Proceseses the data and creates a sample set to analyze.
* Runs some summary statistics on the full English data set.
* Creates a text corpus that natural langugage processing code can use to mine the text data
* Runs some basic exploratory plots using word count and word pairing frequencies
* Talks about next steps
02 Data Processing
Once we’ve configured R to load in necessary software packages, the first step is to load in our text data:
## Warning: package 'R.utils' was built under R version 3.4.2
## Warning: package 'tm' was built under R version 3.4.3
## Warning: package 'NLP' was built under R version 3.4.1
## Warning: package 'Rcpp' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
Loading the data:
# Load in the english versions of our text files:
englishBlogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
englishNews <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
englishTwitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
# Create an aggregated sample of all of our text data:
SAMPLE_SIZE = 10000
sampleTwitter <- englishTwitter[sample(1:length(englishTwitter),SAMPLE_SIZE)]
sampleNews <- englishNews[sample(1:length(englishNews),SAMPLE_SIZE)]
sampleBlogs <- englishBlogs[sample(1:length(englishBlogs),SAMPLE_SIZE)]
textSample <- c(sampleTwitter,sampleNews,sampleBlogs)
# Write the aggregated sample to a text file:
writeLines(textSample, "sample/textSample.txt")
theSampleCon <- file("sample/textSample.txt")
theSample <- readLines(theSampleCon)
close(theSampleCon)
03 Basic Report of Summary Statistics about the data sets:
# File Sizes:
englishTwitterSize <- round(file.info("final/en_US/en_US.twitter.txt")$size / (1024*1024),0)
englishNewsSize <- round(file.info("final/en_US/en_US.news.txt")$size / (1024*1024),0)
englishBlogsSize <- round(file.info("final/en_US/en_US.blogs.txt")$size / (1024*1024),0)
englishSampleFileSize <- round(file.info("sample/textSample.txt")$size / (1024*1024),0)
# Line Counts:
numEnglishTwitterLines <- countLines("final/en_US/en_US.twitter.txt")[1]
numEnglishNewsLines <- countLines("final/en_US/en_US.news.txt")[1]
numEnglishBlogsLines <- countLines("final/en_US/en_US.blogs.txt")[1]
numEnglishSampleLines <- countLines("sample/textSample.txt")[1]
# Word Counts:
numWordsEnglishTwitter <- as.numeric(system2("wc", args = "-w < final/en_US/en_US.twitter.txt", stdout=TRUE))
numWordsEnglishNews <- as.numeric(system2("wc", args = "-w < final/en_US/en_US.news.txt", stdout=TRUE))
numWordsEnglishBlog <- as.numeric(system2("wc", args = "-w < final/en_US/en_US.blogs.txt", stdout=TRUE))
numWordsEnglishSample <- as.numeric(system2("wc", args = "-w < sample/textSample.txt", stdout=TRUE))
# Creating a data frame:
fileSummary <- data.frame(
fileName = c("Blogs","News","Twitter", "Aggregated Sample"),
fileSize = c(round(englishBlogsSize, digits = 2),
round(englishNewsSize,digits = 2),
round(englishTwitterSize, digits = 2),
round(englishSampleFileSize, digits = 2)),
lineCount = c(numEnglishBlogsLines, numEnglishNewsLines, numEnglishTwitterLines, numEnglishSampleLines),
wordCount = c(numWordsEnglishBlog, numWordsEnglishNews, numWordsEnglishTwitter, numWordsEnglishSample)
)
colnames(fileSummary) <- c("Name", "Size", "Num Lines", "Num Words")
fileSummary
## Name Size Num Lines Num Words
## 1 Blogs 200 899288 37334690
## 2 News 196 1010242 34372720
## 3 Twitter 159 2360148 30374206
## 4 Aggregated Sample 5 30000 880463
04 Create and Clean Corpus:
From our sample text file we can create a text corpus, in order to give to our natural language processing code the tools for conduct the word analysis:
# Setup The Text Mining Class:
cname <- file.path(".", "sample")
finalCorpus <- Corpus(DirSource(cname))
# Convert corpus to lowercase:
finalCorpus <- tm_map(finalCorpus, content_transformer(tolower))
# Remove more transforms:
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
finalCorpus <- tm_map(finalCorpus, toSpace, "/|@|\\|")
# Remove punctuation:
finalCorpus <- tm_map(finalCorpus, removePunctuation)
# Remove numbers:
finalCorpus <- tm_map(finalCorpus, removeNumbers)
# Strip whitespace:
finalCorpus <- tm_map(finalCorpus, stripWhitespace)
# Initiate stemming:
finalCorpus <- tm_map(finalCorpus, stemDocument)
04 Create Our ‘N-Grams’ for Exploratory Data Analysis:
Next we create ‘N-Grams’
# Create a unigram:
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unigram <- DocumentTermMatrix(finalCorpus, control = list(tokenize = unigramTokenizer))
unigramFreq <- sort(colSums(as.matrix(unigram)), decreasing=TRUE)
unigramWordFreq <- data.frame(word=names(unigramFreq), freq=unigramFreq)
# Create a bigram:
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram <- DocumentTermMatrix(finalCorpus, control = list(tokenize = bigramTokenizer))
bigramFreq <- sort(colSums(as.matrix(bigram)), decreasing=TRUE)
bigramWordFreq <- data.frame(word=names(bigramFreq), freq=bigramFreq)
# Create a trigram:
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram <- DocumentTermMatrix(finalCorpus, control = list(tokenize = trigramTokenizer))
trigramFreq <- sort(colSums(as.matrix(trigram)), decreasing=TRUE)
trigramWordFreq <- data.frame(word=names(trigramFreq), freq=trigramFreq)
05 Exploratory Plots:
Unigrams: How many times do we see one word repeated in the text corpus.
# First Create A Plot of Our Unigrams:
unigramWordFreq %>% filter(freq > 1000) %>% ggplot(aes(word,freq)) +
geom_bar(stat="identity", colour="#37006b", fill="#a257e9") +
ggtitle("Unigrams With Frequencies Greater Than 1000") +
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.text=element_text(size=14), axis.title=element_text(size=14,face="bold")) +
theme(plot.title = element_text(lineheight=1.8, face="bold", vjust=3))

Bigrams: Sequence of two words.
# Plot bigrams:
bigramWordFreq %>% filter(freq > 100) %>% ggplot(aes(word,freq)) +
geom_bar(stat="identity", colour="#990068", fill="#cf6aaf") +
ggtitle("Bigrams With Frequencies Greater Than 100") +
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.text=element_text(size=14), axis.title=element_text(size=14,face="bold")) +
theme(plot.title = element_text(lineheight=1.8, face="bold", vjust=3))

Trigrams: sequence of three words.
# Plot trigrams:
trigramWordFreq %>% filter(freq > 10) %>% ggplot(aes(word,freq)) +
geom_bar(stat="identity", colour="#00470d", fill="#4ebc63") +
ggtitle("Trigrams With Frequencies Greater Than 10") +
xlab("Trigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.text=element_text(size=14), axis.title=element_text(size=14,face="bold")) +
theme(plot.title = element_text(lineheight=1.8, face="bold", vjust=3))

06 Exploratory Plots 2:
Word cloud plots inside our n-grams.
Unigram Word Cloud Plot:
set.seed(1991)
wordcloud(names(unigramFreq), unigramFreq, max.words=50, scale=c(5, .1), colors=brewer.pal(6, "Paired"))

Bigram Word Cloud Plot:
wordcloud(names(bigramFreq), bigramFreq, max.words=50, scale=c(5, .1), colors=brewer.pal(6, "Set1"))

Trigram Word Cloud Plot:
wordcloud(names(trigramFreq), trigramFreq, max.words=50, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

07 Conclusions and Next Steps:
We follow with a prediction application. I’ll have to create a prediction algorithm and ensure that it runs quickly for acceptable use as a web product. Also I might have to find a way than just using N-Gram tokenization to predict the next word in a sequence of words. The deployed Shiny app should ideally satisfy all of the algorithm requirements.