Milestone Report: SwiftKey Text Mining

Summary

In this report, we obtain the ‘SwiftKey’ data set and load a subset of the english dataset for further processing. The data will be cleaned by applying basic filters from the package ‘tm’ and removing profanity. Last, we build data frames of individual, pairs and triples of words and explore their distribution and frequence. We also provide a brief hint on the modeling plan at the end of the document.

Obtaining Data in R

First, we started by dowinloading the swiftkey dataset using the url provided on the course website and unziping the file.

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")){
    download.file(url, "Coursera-SwiftKey.zip", method = "curl")
} else if (!dir.exists("final/")){
    unzip("Coursera-SwiftKey.zip")
}

Here we have a look on the English dataset which will be used for the rest of the report.

dir("final/en_US/")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

The following chunk of code gives a basic summary on the three files in the English dataset. These are the number of lines, characters and number of each with at least one non-white space.

require("stringi")
rbind("Blogs" = stri_stats_general(readLines("final/en_US/en_US.blogs.txt")),
      "News" = stri_stats_general(readLines("final/en_US/en_US.news.txt")),
      "Twitter" = stri_stats_general(readLines("final/en_US/en_US.twitter.txt")))

##           Lines LinesNEmpty     Chars CharsNWhite
## Blogs    899288      899288 206824382   170389539
## News    1010242     1010242 203223154   169860866
## Twitter 2360148     2360148 162096031   134082634

Cleaning Data

Subseting the English Text Files

Second, and for further processing, we read 5000 lines of each file as a representitive subset of the data. Then we save these subsets in the directory sub/.

if(!dir.exists("sub/")){
    dir.create("sub/")
    write(readLines("final/en_US/en_US.blogs.txt", 5000), "sub/blogs.txt")
    write(readLines("final/en_US/en_US.news.txt", 5000), "sub/news.txt")
    write(readLines("final/en_US/en_US.twitter.txt", 5000), "sub/twitter.txt")
}
dir("sub/")

## [1] "blogs.txt"   "news.txt"    "twitter.txt"

Applying Basic Text Filters

First step in cleaning the data is applying basic filters from the package ‘tm’. These are to remove white spaces, punctuation, numbers and stop words. In addition, the function transform all letters to lower case, stem words and reads the files in ‘PlainTextDocument’ class. The function is applied after transforming the data to a ‘corpus’ and returns a corpus which will be used for the further analysis.

cleantext <- function(doc){
    require(tm)
    doc <- tm_map(doc, stripWhitespace) # remove white spaces  
    doc <- tm_map(doc, removePunctuation) # remove punctuation  
    doc <- tm_map(doc, removeNumbers) # remove numbers    
    doc <- tm_map(doc, tolower) # trun words to lower case    
    doc <- tm_map(doc, removeWords, stopwords("english")) # remove stop words
    doc <- tm_map(doc, stemDocument) # steming  
    doc <- tm_map(doc, PlainTextDocument)   
    return(doc)
}

corpus <- cleantext( Corpus(DirSource("sub/")) )

Profanity Filtering

Here, we obtain a list of profan words and apply a filter on the corpus to remove the words that match.

url <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
if(!file.exists("badwords.txt")){
    download.file(url, "badwords.txt")
}
corpus <- tm_map(corpus, removeWords,
                 readLines("badwords.txt"))

Exploratory Data Analysis

Creating dataframes of uni, bi and tri grams

First step in EDA is transforming the corpus into data frame. Then using ‘NGramTokenizer’ from package ‘RWeka’ to create collection of individual, pairs and trips of words. Then we tabulate the words with their frequency as they appear in the text the we order the data frames in a descending order using ‘dplyr’ package.

corpus.df <- data.frame(text=unlist(sapply(corpus,'[',"content")),stringsAsFactors=F)
TokenizersDelimiters <- "\"\'\\t\\r\\n ().,;!?"

arrangbyfreq <- function(x){
    require("dplyr")
    arrange(data.frame(x), desc(Freq))
}

require(RWeka)
unigram <- arrangbyfreq(table(UniTokenizer = NGramTokenizer(corpus.df, Weka_control(min = 1, max = 1))))
bigram <- arrangbyfreq(table(BiTokenizer = NGramTokenizer(corpus.df, Weka_control(min = 2, max = 2, delimiters = TokenizersDelimiters))))
trigram <- arrangbyfreq(table(TriTokenizer = NGramTokenizer(corpus.df, Weka_control(min = 3, max = 3, delimiters = TokenizersDelimiters))))

Distribution of words and relationship between the words in the corpora

This basic histogram shows the frequency of individual words as they appear on the text. Most of the words appears very few times as we see from the skewed histogram.

hist(unigram$Freq,
     breaks = 200,
     lwd = 2,
     main = "Distribution of Individual Words", xlab = "Indvidual Words",
     col = "gray")

Variation in the frequencies of words and word pairs in the data.

The following chunk of code showes a basic summary of the representitve subsets. Minimum, maximum and quantiles of numbers an individual word, pairs and triples of words appears in the text.

rbind("Individual Words" = summary(unigram$Freq),
      "Pairs of Words" = summary(bigram$Freq),
      "Triples of Words" = summary(trigram$Freq))

##                  Min. 1st Qu. Median  Mean 3rd Qu. Max.
## Individual Words    1       1      1 3.422       3  304
## Pairs of Words      1       1      1 1.049       1   26
## Triples of Words    1       1      1 1.002       1    5

The following graph showes the ten most frequent individual word, pairs and triples of words in the text along with their frequency.

plotfreq <- function(x,y,z){
    barplot(x[1:10,2],
            names.arg=x[1:10,1],
            horiz = TRUE,
            las = 2,
            main = y,
            col = z)
}

par(mfrow = c(1,3))
plotfreq(unigram, "Individual", "blue")
plotfreq(bigram, "Pairs", "red")
plotfreq(trigram, "Trips", "yellow")

Modeling Plans

My fruther plans involves building a probablistic model of ‘NGram’ sets of the data. These to calculate the probablity of a word to come next to a single or a pair of other words. Based on this probability, the model can predict the next three words.