We are planning to create an application that predicts the next word of a text while the text is being typed. For example, if I have read many
is alread typed, then the next word could be books
or stories
but it is highly unlikely that the next word is theirs
. The basic idea is to count how many times triples read many books
and read many theirs
occur in some large body of real text and see that the former is much more frequent than the latter (such triples of words are called trigrams).
The large body of real text that we will use to create our models consists of three sets: about 900K blog entries, 1M news stories, and 2.3M twitter messages, each containing more than 30M words. Together, there are more than 100M words in the three bodies of text. Here, we download the data, load it into R, clean, filter for profanity, and perform some simple exploratory analysis. We identify most common words, bigrams, and trigrams.
We found that less than 2% of all entries contain obscene language. Since it’s not much, we simply delete such entries to avoid predicting curse words. Besides, we download an “official” English dictionary and match words found in the real data vs the dictionary. According to our findings, about 14% of all the “real” words do not appear in the dictionary, e.g., that's
and lol
. However, many words that appear in the data but not in the dictionary are foreign or misspelled words that we don’t want to use in prediction — they constitute about 2.5% of all the words.
We found that top 20 most frequent words constitute about 24% of all words, top 20 most frequent bigrams constitue about 3% of all bigrams, and top 20 most frequent trigrams constitute less than 1% of all trigrams.
Here we load the necessary libraries and set the working directory.
library(ggplot2)
library(stringi)
library(RWeka)
setwd("/Users/orlenkoirina/Dropbox/Data Analysis with R/Capstone project 2")
Now we check if the directory final/en_US
exists and contains 3 files. If it doesn’t, we download them first.
if (nrow(file.info(dir("final/en_US/")))!=3) {
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url,destfile="Coursera-SwiftKey.zip",method="curl")
unzip("Coursera-SwiftKey.zip")
}
statLBlogs <- stri_stats_latex(readLines("final/en_US/en_US.blogs.txt"))
statLNews <- stri_stats_latex(readLines("final/en_US/en_US.news.txt"))
statLTwitter <- stri_stats_latex(readLines("final/en_US/en_US.twitter.txt"))
statBlogs <- stri_stats_general(readLines("final/en_US/en_US.blogs.txt"))
statNews <- stri_stats_general(readLines("final/en_US/en_US.news.txt"))
statTwitter <- stri_stats_general(readLines("final/en_US/en_US.twitter.txt"))
Here are some simple statistics for the 3 text files:
“final/en_US/en_US.blogs.txt” has 899288 entries and 37570839 words.
“final/en_US/en_US.news.txt” has 1010242 entries and 34494539 words.
“final/en_US/en_US.twitter.txt” has 2360148 entries and 30451128 words.
To avoid processing big data sets on the preliminary strage, we create a small subset of the three data sets by including every entry with probability 0.01.
sampleTexts <- c(readLines("final/en_US/en_US.blogs.txt")[
as.logical(rbinom(n=statBlogs["Lines"],size=1,prob=SampleProb))],
readLines("final/en_US/en_US.news.txt")[
as.logical(rbinom(n=statNews["Lines"],size=1,prob=SampleProb))],
readLines("final/en_US/en_US.twitter.txt")[
as.logical(rbinom(n=statTwitter["Lines"],size=1,prob=SampleProb))]
)
numSampleEntries <- length(sampleTexts)
Now we have 42758 entries in our sample.
To avoind predicting curse words, we simply delete entries that contain them. We are not trying to be comprehensive here, so we’ll just select three curse words that we’ll take care of.
cursePattern <- "([Ff]+[Uu]+[Cc]+[Kk]+)|([Ss]+[Hh]+[Ii]+[Tt]+)|([Cc]+[Uu]+[Nn]+[Tt]+)"
ind <- grepl(cursePattern,sampleTexts)
numProfaneEntries <- sum(ind)
We see that there are 487 entries with curse words among the total 42758 entries in the sample. Since only 1.14% of all entries contain obscene language, we simply delete them.
sampleTexts <- sampleTexts[!ind]
numSampleEntries <- length(sampleTexts)
Now we are left with 42324 entries.
The data set contains a lot of special characters and things that can hardly be accurately predicted, like numbers and URL’s. First, we delete all characters that are not puncuation, alphabet, or digit. Then we just change
all URL’s to www
,
all emails to email
,
all occurrences of number 1 to one
,
all non-1 cardinal (i.e., 78) numerals to many
.
all ordinal (i.e., 78th) numerals to nth
.
Besides, cleaning numbers is a bit tricky because of commas — sometimes 2,360,148 is used instead of 2360148 for convenience. The following chunk of code is hidden because it’s ugly.
We explore our sample (less than 1% of the whole data set). First, we count the frequency of every word and create a table of all words that actually appear in the sample.
token_delim <- " \\t\\r\\n.!?,:;&\"-()[]_<>/…¡¿:、·。#@=/$"
tokenizedText <- WordTokenizer(sampleTexts,
Weka_control(delimiters = token_delim))
words <- as.data.frame(table(tokenizedText))
names(words) <- c("Word","Freq")
Here is the histogram of 20 most popular words.
Together, 20 most popular words account for 287027 among all 1016489 words in the sample. They constitute 28.24% of all the words and this percentage would probably be about the same if we used the full texts for our analysis.
A bigram is a pair of words that occur one after another and a trigram is a triple of words that occur together. For example, the sentence silicon elephants die hard on midnight
had the following trigrams: silicon elephants die
, elephants die hard
, die hard on
, hard on midnight
.
First, we identify bigrams.
token_delim <- " .,;:\"?!#<>{}$<>\\/()…¡¿:、·。"
tokenizedText <- NGramTokenizer(sampleTexts,
Weka_control(min=2,max=2,delimiters = token_delim))
bigrams <- as.data.frame(table(tokenizedText))
names(bigrams) <- c("Word","Freq")
Here is the histogram of most common bigrams
There are 972423 different bigrams in the sample and 20 most popular ones account for 3.04% of all bigrams.
Now we identify trigrams and plot most popular trigrams.
There are 930234 different trigrams in the sample and 20 most popular ones account for 0.37% of all trigrams.
Of course, not every word that we actually observe in the sample is a valid English word. For instance, there are rare names, forgeign words, misspelled words, combinations with an apostrophe. At the same time, not every valid English word appears in the sample. We will now download a dictionary of English words.
url <- "http://dreamsteep.com/downloads/word-games-and-wordsmith-utilities/120-the-english-open-word-list-eowl/file.html"
if (nrow(file.info(dir("EOWL-v1.1.2")))!=26) {
url <- "http://dreamsteep.com/downloads/word-games-and-wordsmith-utilities/120-the-english-open-word-list-eowl/file.html"
download.file(url,destfile="english_dict.zip",method="curl")
unzip("english_dict.zip")
}
allWords <- character(0)
for (ind in 1:26) {
dir.name <- "EOWL-v1.1.2/LF Delimited Format"
currentFile <- paste(dir.name,dir(dir.name)[ind],sep="/")
allWords <- c(allWords,readLines(currentFile))
}
allWords <- tolower(allWords)
nonDictWords <- words[words$Word %in% setdiff(words$Word,allWords),]
Now the variable allWords
contains the dictionary - an “official” list of valid English words. Our sample contains 52773 different words, but among them there are 27971 that are not found in the dictionary. Here are 5 random and 10 most frequent un-official words observed in the sample:
nonDictWords[c(1000,2000,4000,8000,16000),]
## Word Freq
## 1150 achey 1
## 3024 ariz 4
## 6799 bryson 1
## 14941 eddie 8
## 29222 mcdonalds 5
head(nonDictWords[order(nonDictWords$Freq,decreasing=TRUE),],n=10)
## Word Freq
## 833 a 24004
## 22837 i 16210
## 24407 it's 2428
## 22842 i'm 2149
## 14025 don't 1575
## 48690 u 1076
## 40247 rt 868
## 7481 can't 774
## 46784 that's 771
## 27680 lol 742
These un-official words constitute 14.24% of all the words in the sample. It’s a big share and we cannot simply ignore them. We assume that every word that appears at least 3 times is useful for prediction. Thus we will just ignore words that are seen in the sample fewer than 3 times and not found in the dictionary.
invalidWords <- nonDictWords[nonDictWords$Freq<FreqThreshold,]
Now we have identified 22706 “invalid” words that should be excluded from predictive models. Together, they constitute 2.61% of all the words in the sample.
For the time being, we ignored punctuation while it could be used for prediction. Symbols like comma or full stop could be treated as words. Also, we didn’t distinguish between numerals that represent time year, money, counting objects etc. while such information is important.
Currently, we changed everything to the lower case. In future, we will try to predict lower/upper case separately from predicting words.
Our analysis has been originally created and run in RStudio v. 0.98.1080 under OS X 10.10.3.
Time and date the report has been generated: 2015-07-26 21:34:00 (Central European time).