Purpose:

This document presents the results of an exploratory data analysis of a training data set provided by SwiftKey. This data set will be used to create a word prediction algorithm in the form of a shiny application. The data set (i.e. the corpus) is comprised of 3 text files (named en_US.blogs.txt, en_US.twitter.txt, and en_US.news.txt). In analyzing these data files, it became clear that some of the upcoming challenges will involve the size of the training data set and creating an algorithm with acceptable performance for the shiny application from this data set.

download data here

dir.create('E:/Coursera', showWarnings = FALSE)
setwd('E:/Coursera')

if(!file.exists("E:/Coursera/Coursera-SwiftKey.zip")) {
        download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip','E:/Coursera/Coursera-SwiftKey.zip')
        unzip('E:/Coursera/Coursera-SwiftKey.zip',exdir='.')
}

Exploratory Data Analysis

Basic Data Statistics

stats.twitter <- as.numeric(shell("wc E:/Coursera/final/en_US/en_US.twitter.txt | gawk '{print $1; print $2; print $3}'", ignore.stderr=TRUE, intern=TRUE))
stats.news <- as.numeric(shell("wc E:/Coursera/final/en_US/en_US.news.txt | gawk '{print $1; print $2; print $3}'", ignore.stderr=TRUE, intern=TRUE))
stats.blogs <- as.numeric(shell("wc E:/Coursera/final/en_US/en_US.blogs.txt | gawk '{print $1; print $2; print $3}'", ignore.stderr=TRUE, intern=TRUE))
stats.df <- data.frame( blogs = stats.blogs, news = stats.news, twitter = stats.twitter, 
                        row.names = c("lines", "words", "characters"), stringsAsFactors = FALSE)
stats.df

##                blogs      news   twitter
## lines         899288   1010242   2360148
## words       37272578  34309642  30341028
## characters 210160014 205811889 167105338

Read 30% of data for performance, at this time.

twitterNumLines30Pct <- stats.twitter[1]*0.3
newsNumLines30Pct <- stats.news[1]*0.3
blogsNumLines30Pct <- stats.blogs[1]*0.3

con <- file("E:/Coursera/final/en_US/en_US.twitter.txt", "r")
twitterData <- readLines(con, warn=FALSE, n=twitterNumLines30Pct)
close(con) 
con <- file("E:/Coursera/final/en_US/en_US.blogs.txt", "r")
blogsData <- readLines(con, warn=FALSE, n=blogsNumLines30Pct)
close(con) 
# Use read binary because there are some chars that cause "r" to stop reading.
con <- file("E:/Coursera/final/en_US/en_US.news.txt", "rb")
newsData <- readLines(con, warn=FALSE, n=newsNumLines30Pct)
close(con)

Looking at the Data

head(twitterData)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

head(blogsData)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"

head(newsData)

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."

Data Processing and Transformations

The first steps in processing the data were to set all text to lower-case, remove punctuation, remove numbers, and collapse multiple spaces into a single space. I also removed all non-alpha characters.

In addition, typos and spelling errors are prevalent. My intention is to fix these for the final project, if possible, but performance and other considerations may make this impractical.

sTwitterData <- toString(twitterData)
sTwitterDataProcessed <- preprocess(sTwitterData, case = "lower", remove.punct = TRUE, remove.numbers = TRUE, fix.spacing = TRUE)
sTwitterDataProcessed <- str_replace_all(sTwitterDataProcessed, "[^[:alpha:]\']", " ")

sBlogsData <- toString(blogsData)
sBlogsDataProcessed <- preprocess(sBlogsData, case = "lower", remove.punct = TRUE, remove.numbers = TRUE, fix.spacing = TRUE)
sBlogsDataProcessed <- str_replace_all(sBlogsDataProcessed, "[^[:alpha:]\']", " ")

sNewsData <- toString(newsData)
sNewsDataProcessed <- preprocess(sNewsData, case = "lower", remove.punct = TRUE, remove.numbers = TRUE, fix.spacing = TRUE)
sNewsDataProcessed <- str_replace_all(sNewsDataProcessed, "[^[:alpha:]\']", " ")

Get Some Stats on Bigrams

twitter2grams <- ngram(sTwitterDataProcessed, 2)
sTable <- get.phrasetable(twitter2grams)
sTableSmall <- sTable[1:20,]

ggplot(aes(x=factor(sTableSmall$ngram), y=sTableSmall$freq), data=sTableSmall) + 
        geom_bar(stat = "identity") +
        labs(x="Bigram", y="Frequency") +
        ggtitle("'Twitter' Bigram Histogram (Top 20 for 30% of Data)") +
        theme(axis.text.x=element_text(angle=60,hjust=1,vjust=0.5)) +
        theme(plot.margin=unit(c(1,1,1,1),"cm"))

blogs2grams <- ngram(sBlogsDataProcessed, 2)
sBlogsTable <- get.phrasetable(blogs2grams)
sBlogsTableSmall <- sBlogsTable[1:20,]

ggplot(aes(x=factor(sBlogsTableSmall$ngram), y=sBlogsTableSmall$freq), data=sBlogsTableSmall) + 
        geom_bar(stat = "identity") +
        labs(x="Bigram", y="Frequency") +
        ggtitle("'Blogs' Bigram Histogram (Top 20 for 30% of Data)") +
        theme(axis.text.x=element_text(angle=60,hjust=1,vjust=0.5)) +
        theme(plot.margin=unit(c(1,1,1,1),"cm"))

news2grams <- ngram(sNewsDataProcessed, 2)
sNewsTable <- get.phrasetable(news2grams)
sNewsTableSmall <- sNewsTable[1:20,]

ggplot(aes(x=factor(sNewsTableSmall$ngram), y=sNewsTableSmall$freq), data=sNewsTableSmall) + 
        geom_bar(stat = "identity") +
        labs(x="Bigram", y="Frequency") +
        ggtitle("'News' Bigram Histogram (Top 20 for 30% of Data)") +
        theme(axis.text.x=element_text(angle=60,hjust=1,vjust=0.5)) +
        theme(plot.margin=unit(c(1,1,1,1),"cm"))

Perspective Algorithm:

The steps identified to create the prediction algorithm are:

    1. Create a clean training data set. The steps I have identified for creating a clean data set (so far) are:
            - lower case all alphabetic characters
            - expand contractions to their non-contracted forms (e.g. "aren't" to "are not").
            - eliminate all non-alphabetic characters
            - fix typos and spelling errors (if possible)
            
    2. Generate 2-gram and 3-gram probability matrices
    
    3. Smooth (perhaps) the 2-gram or 3-gram probabilities to allow for ngrams not found in the data set

Steps that I have decided would not be appropriate for this project are:

    - Stemming. For a word prediction algorithm it seems we would want to predict the proper forms of words so I will not be performing any stemming.

Resources

The following resources were researched and used in creating this document:

Natural Language Processing Jurafsky and Martin NLP Gnu Utilities Gawk for Windows Ngram for R quanteda text2vec

Exploratory Data Analysis - SwiftKey Data

Jackie Goor

June 12, 2016