The purpose of this document is to report the exploratory analysis of the unstructured data from blogs, news, and twitters prepared by SwiftKey. Each unstructured data were provided as a three separate text file. The ultimate goal of the project is to suggest the next word as you type the text based on a Markov assumption that the probability of the next word depends only on the previous word using the digram language model.
Text data were ingested with readLines function using latin1 encoding. The character encoding used to import US English text was latin1. UTF-8 would be a good choice if many non-alphabetic characters were expected, but it did not apply in this case.
# load entire data
loadtext<-function(fileName){
con <- file(fileName , "r")
tmp<-readLines(con, encoding="latin1", warn=FALSE )
close(con)
tmp
}
word.stats<-list()
line.stats<-list()
blog<-loadtext("en_US.blogs.txt")
longest.line<-max(nchar(blog))
word.stats[["blog"]]<-ngram::wordcount(blog)
line.stats[["blog"]]<-NROW(blog)
news<-loadtext("en_US.news.txt")
word.stats[["news"]]<-ngram::wordcount(news)
line.stats[["news"]]<-NROW(news)
twitter<-loadtext("en_US.twitter.txt")
word.stats[["twitter"]]<-ngram::wordcount(twitter)
line.stats[["twitter"]]<-NROW(twitter)
Since the size of the text were big, and it would be challenging to process these data, subset of text data were sampled for further processing.
#load sample
loadsampletext<-function(fileName){
con <- file(fileName , "r")
tmp<-readLines(con, encoding="latin1", warn=FALSE )
close(con)
set.seed(2018)
tmp[rbinom(n=length(tmp)*.05,size=length(tmp), p=.5)]
}
blog<-loadsampletext("en_US.blogs.txt")
news<-loadsampletext("en_US.news.txt")
twitter<-loadsampletext("en_US.twitter.txt")
all<-c(blog, news, twitter)
Twitter often consists of hash tag, and the user name. Other text types may contain URLs and email address. These were removed as they won’t be suitable for predictions.
options(max.print = 25)
#remove URL, email address, twitter user name
remove.email.url<-function(input){
input%>% gsub(pattern="\\b(ftp|http|https)?[://]?(www.?)[A-z0-9]*.[A-z0-9]*$",replacement="") %>% gsub(pattern="\\b[A-z0-9]*[@]{1}[A-z0-9]*[.]{1}[A-z0-9]*$",replacement="") %>%
gsub(pattern="^@[A-z0-9]+",replacement="")
}
all<-remove.email.url(all)
All the cases were changed to the lower cases. The profanity list compiled by Parker and released on March 26, 2018 was used to remove the profanity words in the ingested text data (Bad Words List, List of Swear Words, Google Banned Words List). Additionally, extra spaces, stop-words, non-ASCII characters, and punctuation were removed.
# bad words
badwords<-readLines("full-list-of-bad-words-text-file_2018_03_26.txt",warn=FALSE)
clean_txt <- function(input) {
doc <-input %>% tm::stripWhitespace() %>%
tolower() %>%
tm::removeNumbers() %>%
tm::removePunctuation() %>%
tm::removeWords(badwords) %>%
tm::removeWords(tm::stopwords('english')) %>% #remove_stopwords provided by package tau.
iconv("latin1","ASCII", sub="") %>%
tm::stripWhitespace()
# unlist(lapply(doc, "[", "content"), use.names=FALSE)
}
all<-clean_txt(all)
The language model chosen is the N-gram, the simplest language model that assigns probabilities to the sequence of words. The text data were tokenized to unigram, digram, and trigram. The unigram is used to explore the words in the data. The diagram and trigram models will be used for generating the predictive models under Markov assumption.
all.uniGrm<-stylo::parse.corpus(all, ngram.size = 1)
## slicing input text into tokens...
## turning words into features, e.g. char n-grams (if applicable)...
all.diGrm<-stylo::parse.corpus(all, ngram.size = 2)
## slicing input text into tokens...
## turning words into features, e.g. char n-grams (if applicable)...
all.triGrm<-stylo::parse.corpus(all, ngram.size =3)
## slicing input text into tokens...
## turning words into features, e.g. char n-grams (if applicable)...
Exploratory analysis were conducted to characterize the text data
#Word count and line count
dotchart2(word.stats, main="Fig 1: Word Count", xlab="Count")
dotchart2(line.stats, main="Fig 2: Line Count", xlab="Count")
word.c<-data.frame(word.stats)
line.c<-data.frame(line.stats)
w.per.l<-data.frame(blog=word.c$blog/line.c$blog,
news=word.c$news/line.c$news,
twitter=word.c$twitter/line.c$twitter)
w.per.l<-as.data.frame(t(w.per.l))
w.per.l$type<-row.names(w.per.l)
colnames(w.per.l)<-c("wpl","type")
ggplot(w.per.l, aes(type,wpl))+geom_col()+labs(title="Fig 3: Words per line", y="Words per line")
#Word Count
word.count <- function(input){
tmp<-data.frame(table(input))
tmp<-tmp[order(tmp$Freq, decreasing=T),]
tmp$input <- factor(tmp$input, levels = tmp$input[order(tmp$Freq)])
tmp
}
all.freq1<-word.count(all.uniGrm)
all.freq2<-word.count(all.diGrm)
all.freq3<-word.count(all.triGrm)
ggplot(all.freq1[1:20,], aes(input ,Freq))+geom_col()+coord_flip()+labs(title="Fig 4: blogs+news+twitter - unigram: word count - top 20")
ggplot(all.freq2[1:20,], aes(input, Freq))+geom_col()+coord_flip()+labs(title="Fig 5: blogs+news+twitter - digram: word count - top 20")
ggplot(all.freq3[1:20,], aes(input, Freq))+geom_col()+coord_flip()+labs(title="Fig 6: blogs+news+twitter - trigram: word count - top 20")
max<-9000
freqpct<-list()
for (i in 1:max){
freqpct[[i]]<-sum(all.freq1[1:i,]$Freq)/sum(all.freq1$Freq)
}
freqpct<-data.frame(t(data.frame(freqpct)))
freqpct$n<-seq(1:max)
names(freqpct)<-c("pct","n")
ggplot(freqpct, aes(pct,n))+geom_line()+labs(title="Fig 7: Portion of all words by the top unique words", x="Proportion of all words (%)", y="Top most frequent unique words")+geom_vline(xintercept = c(0.5,0.9))+xlim(c(0,1)) +geom_hline(yintercept = c(759,7924),col="red")
As seen in the word counts Figure 1 and the line counts Figure 2 for each unstructured data, blog text contains the highest number of words (n=37334131) while twitter text contains the highest number of lines (n=2360148). Figure 3 shows the number of words per type of text. Blog tends to use more words per line (42) as opposed to news (34). As expected twitter uses less words per line as it has 140 characters limit. Figure 4, 5, and 6 show the frequency of unigram, digram, and trigram respectively from all three type of text data. Obviously, unigram has higher counts than digram and trigram.
There were 21165 unique words across three different sample of text and total of words were identified from these samples. Figure 7 shows the proportion of all words used by the top most common words across the unstructured text data. The top 759 (3.6 %) most frequent words account for 50% of of all words and the top 7924 (37 %) most frequent words account for 90% of all words.
In general, the same approach could be taken for other foreign languages but the process would require a language specific set of stop words, a new method of stemming, and a new method of splitting the words into unigram, digram, and trigram.
Using dictionary might be another way to increase the coverage.
As seen in Figure 7, the proportion of all words used by the top most common words, 90% of all words are covered by the top 7924 words. Therefore, two n-gram models will be built for model comparison.
The simplest n-gram model can be built using the Markov assumption that the next word depends only on the current word.
babble(ngram(all.diGrm[["1"]]), genlen=5, seed=2018)
## [1] "across town mention raised one "
Parker, James. “Full List of Bad Words and Swear Words Banned by Google.” Free Web Headers, 26 March 2018, www.freewebheaders.com/full-list-of-bad-words-banned-by-google/. 28 July 2018
Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (3rd ed.). 2017. Prentice Hall PTR, Upper Saddle River, NJ, USA. pp. 35-41