Summary

We are planning to create an application that predicts the next word of a text while the text is being typed. For example, if I have read many is alread typed, then the next word could be books or stories but it is highly unlikely that the next word is theirs. The basic idea is to count how many times triples read many books and read many theirs occur in some large body of real text and see that the former is much more frequent than the latter (such triples of words are called trigrams).

The large body of real text that we will use to create our models consists of three sets: about 900K blog entries, 1M news stories, and 2.3M twitter messages, each containing more than 30M words. Together, there are more than 100M words in the three bodies of text. Here, we download the data, load it into R, clean, filter for profanity, and perform some simple exploratory analysis. We identify most common words, bigrams, and trigrams.

We found that less than 2% of all entries contain obscene language. Since it’s not much, we simply delete such entries to avoid predicting curse words. Besides, we download an “official” English dictionary and match words found in the real data vs the dictionary. According to our findings, about 14% of all the “real” words do not appear in the dictionary, e.g., that's and lol. However, many words that appear in the data but not in the dictionary are foreign or misspelled words that we don’t want to use in prediction — they constitute about 2.5% of all the words.

We found that top 20 most frequent words constitute about 24% of all words, top 20 most frequent bigrams constitue about 3% of all bigrams, and top 20 most frequent trigrams constitute less than 1% of all trigrams.

Loading the data into R

Libraries and working directory

Here we load the necessary libraries and set the working directory.

library(ggplot2)
library(stringi)
library(RWeka)
setwd("/Users/orlenkoirina/Dropbox/Data Analysis with R/Capstone project 2")

Downloading the data files if necessary

Now we check if the directory final/en_US exists and contains 3 files. If it doesn’t, we download them first.

if (nrow(file.info(dir("final/en_US/")))!=3) {
        url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        download.file(url,destfile="Coursera-SwiftKey.zip",method="curl")
        unzip("Coursera-SwiftKey.zip")
}
statLBlogs <- stri_stats_latex(readLines("final/en_US/en_US.blogs.txt"))
statLNews <- stri_stats_latex(readLines("final/en_US/en_US.news.txt"))
statLTwitter <- stri_stats_latex(readLines("final/en_US/en_US.twitter.txt"))
statBlogs <- stri_stats_general(readLines("final/en_US/en_US.blogs.txt"))
statNews <- stri_stats_general(readLines("final/en_US/en_US.news.txt"))
statTwitter <- stri_stats_general(readLines("final/en_US/en_US.twitter.txt"))

Here are some simple statistics for the 3 text files:

“final/en_US/en_US.blogs.txt” has 899288 entries and 37570839 words.
“final/en_US/en_US.news.txt” has 1010242 entries and 34494539 words.
“final/en_US/en_US.twitter.txt” has 2360148 entries and 30451128 words.

Creating a small subset

To avoid processing big data sets on the preliminary strage, we create a small subset of the three data sets by including every entry with probability 0.01.

sampleTexts <- c(readLines("final/en_US/en_US.blogs.txt")[
        as.logical(rbinom(n=statBlogs["Lines"],size=1,prob=SampleProb))],
        readLines("final/en_US/en_US.news.txt")[
                as.logical(rbinom(n=statNews["Lines"],size=1,prob=SampleProb))],
        readLines("final/en_US/en_US.twitter.txt")[
                as.logical(rbinom(n=statTwitter["Lines"],size=1,prob=SampleProb))]
        )
numSampleEntries <- length(sampleTexts)

Now we have 42758 entries in our sample.

Cleaning the data

Filtering profanity

To avoind predicting curse words, we simply delete entries that contain them. We are not trying to be comprehensive here, so we’ll just select three curse words that we’ll take care of.

cursePattern <- "([Ff]+[Uu]+[Cc]+[Kk]+)|([Ss]+[Hh]+[Ii]+[Tt]+)|([Cc]+[Uu]+[Nn]+[Tt]+)"
ind <- grepl(cursePattern,sampleTexts)
numProfaneEntries <- sum(ind)

We see that there are 487 entries with curse words among the total 42758 entries in the sample. Since only 1.14% of all entries contain obscene language, we simply delete them.

sampleTexts <- sampleTexts[!ind]
numSampleEntries <- length(sampleTexts)

Now we are left with 42324 entries.

Cleaning the data

The data set contains a lot of special characters and things that can hardly be accurately predicted, like numbers and URL’s. First, we delete all characters that are not puncuation, alphabet, or digit. Then we just change

all URL’s to www,
all emails to email,
all occurrences of number 1 to one,
all non-1 cardinal (i.e., 78) numerals to many.
all ordinal (i.e., 78th) numerals to nth.

Besides, cleaning numbers is a bit tricky because of commas — sometimes 2,360,148 is used instead of 2360148 for convenience. The following chunk of code is hidden because it’s ugly.

Exploratory analysis

Word count

We explore our sample (less than 1% of the whole data set). First, we count the frequency of every word and create a table of all words that actually appear in the sample.

token_delim <- " \\t\\r\\n.!?,:;&\"-()[]_<>/…¡¿：、·。#@=/$"
tokenizedText <- WordTokenizer(sampleTexts,
                    Weka_control(delimiters = token_delim))
words <- as.data.frame(table(tokenizedText))
names(words) <- c("Word","Freq")

Here is the histogram of 20 most popular words.

plot of chunk Word histogram

Together, 20 most popular words account for 287027 among all 1016489 words in the sample. They constitute 28.24% of all the words and this percentage would probably be about the same if we used the full texts for our analysis.

Bigrams and trigrams

A bigram is a pair of words that occur one after another and a trigram is a triple of words that occur together. For example, the sentence silicon elephants die hard on midnight had the following trigrams: silicon elephants die, elephants die hard, die hard on, hard on midnight.

First, we identify bigrams.

token_delim <- " .,;:\"?!#<>{}$<>\\/()…¡¿：、·。"
tokenizedText <- NGramTokenizer(sampleTexts,
                               Weka_control(min=2,max=2,delimiters = token_delim))
bigrams <- as.data.frame(table(tokenizedText))
names(bigrams) <- c("Word","Freq")

Here is the histogram of most common bigrams

plot of chunk Bigram histogram

There are 972423 different bigrams in the sample and 20 most popular ones account for 3.04% of all bigrams.

Now we identify trigrams and plot most popular trigrams.

plot of chunk unnamed-chunk-10

There are 930234 different trigrams in the sample and 20 most popular ones account for 0.37% of all trigrams.

Identifying invalid words

Of course, not every word that we actually observe in the sample is a valid English word. For instance, there are rare names, forgeign words, misspelled words, combinations with an apostrophe. At the same time, not every valid English word appears in the sample. We will now download a dictionary of English words.

url <- "http://dreamsteep.com/downloads/word-games-and-wordsmith-utilities/120-the-english-open-word-list-eowl/file.html"
if (nrow(file.info(dir("EOWL-v1.1.2")))!=26) {
        url <- "http://dreamsteep.com/downloads/word-games-and-wordsmith-utilities/120-the-english-open-word-list-eowl/file.html"
        download.file(url,destfile="english_dict.zip",method="curl")
        unzip("english_dict.zip")
}
allWords <- character(0)
for (ind in 1:26) {
        dir.name <- "EOWL-v1.1.2/LF Delimited Format"
        currentFile <- paste(dir.name,dir(dir.name)[ind],sep="/")
        allWords <- c(allWords,readLines(currentFile))
}
allWords <- tolower(allWords)
nonDictWords <- words[words$Word %in% setdiff(words$Word,allWords),]

Now the variable allWords contains the dictionary - an “official” list of valid English words. Our sample contains 52773 different words, but among them there are 27971 that are not found in the dictionary. Here are 5 random and 10 most frequent un-official words observed in the sample:

nonDictWords[c(1000,2000,4000,8000,16000),]

##            Word Freq
## 1150      achey    1
## 3024       ariz    4
## 6799     bryson    1
## 14941     eddie    8
## 29222 mcdonalds    5

head(nonDictWords[order(nonDictWords$Freq,decreasing=TRUE),],n=10)

##         Word  Freq
## 833        a 24004
## 22837      i 16210
## 24407   it's  2428
## 22842    i'm  2149
## 14025  don't  1575
## 48690      u  1076
## 40247     rt   868
## 7481   can't   774
## 46784 that's   771
## 27680    lol   742

These un-official words constitute 14.24% of all the words in the sample. It’s a big share and we cannot simply ignore them. We assume that every word that appears at least 3 times is useful for prediction. Thus we will just ignore words that are seen in the sample fewer than 3 times and not found in the dictionary.

invalidWords <- nonDictWords[nonDictWords$Freq<FreqThreshold,]

Now we have identified 22706 “invalid” words that should be excluded from predictive models. Together, they constitute 2.61% of all the words in the sample.

Plans for future work

For the time being, we ignored punctuation while it could be used for prediction. Symbols like comma or full stop could be treated as words. Also, we didn’t distinguish between numerals that represent time year, money, counting objects etc. while such information is important.

Currently, we changed everything to the lower case. In future, we will try to predict lower/upper case separately from predicting words.

System info

Our analysis has been originally created and run in RStudio v. 0.98.1080 under OS X 10.10.3.

Time and date the report has been generated: 2015-07-26 21:34:00 (Central European time).

Data Science Final Project: milestone report

Fedor Duzhin

25 July 2015

Summary

Loading the data into R

Libraries and working directory

Downloading the data files if necessary

Creating a small subset

Cleaning the data

Filtering profanity

Cleaning the data

Exploratory analysis

Word count

Bigrams and trigrams

Identifying invalid words

Plans for future work

System info