Milestone Report for the Capstone Project

milestoneproj.Rmd Prepared by Marcel Merchat September 4, 2016

Title: Text Prediction Algorithm

Synopsis:

The purpose of this milestone report is to demonstrate some key project issues such as the following:

Download the US English data files and import it into the program.
Provide summary statistics about the data sets.
Outline a plan for a prediction algorithm and Shiny Application.

Download Data

The data is unzipped into a folder called Data.

setwd("~/edu/Data Science/capstone/Project")

##  fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
##  download.file(fileUrl, dest="zipData.zip")
##  unzip("zipData.zip", exdir=".")

Summary of Raw Data:

There are folders for four languages. One of the folders is for English and it has three files of chatting, news, and blogs as follows:

en_US.twitter.txt en_US.news.txt en_US.blogs.txt

The chat data consists of over 2 million lines of chat. Similarly there are about 77,000 lines of news and 900,000 lines of blog text some of which are the length of journal articles. For the United States, there are about 30,000,000 words in the Twitter data file, about 35,000,000 in the Blog data file, and about 2.6 million in news file. Below we estimate the true word counts from the training data sets we construct.

Line Counts

setwd("~/edu/Data Science/capstone/Project")

ustwitter <- readLines("./Data/en_US/en_US.twitter.txt")
length(ustwitter) #[1] 2360148

## [1] 2360148

usnews <- readLines("./Data/en_US/en_US.news.txt")
length(usnews)     #[1] 77259

## [1] 77259

usblogs <- readLines("./Data/en_US/en_US.blogs.txt")
length(usblogs)    #[1] 899288

## [1] 899288

Processing for Training Data Set

Word Count Tables for Training Data

We will estimate the number of words in the United States data set directly below using the training data set.

Training Data Sets

We considered the AppliedPredictiveModeling package but the us_twitter data file required too much computer memory. Selecting random lines with a random TRUE or FALSE vector generated with the R rbinom function worked. Our analysis and exploration is based on a very small sampling of the data set but it still includes approximately one million words of text from each of the three files.

Word Count Estimates from Training Data

For the United States, there are about 30,000,000 words in the Twitter file, about 35,000,000 in the Blog file, and about 2.6 million in news file.

##  us_twitter
    get_word_count(twitraining)/(prob*samples/twitter_word_count1)

## [1] 29652186

##  us_blogs
    get_word_count(blogtraining)/(prob*samples/blog_word_count1)

## [1] 37238042

##  us_news
    get_word_count(newstraining)/(prob*samples/news_word_count1)

## [1] 2602514

Histograms

The histogram below shows there are many rare words which are not used very much while only a few words appear many times. To illustrate this, the x-axis represents the popularity of words and the y axis indicates how many words have a given amount of popularity.

Individual Word Frequencies (1-Grams)

  dfsources <- rbind(twitter_1_grams, blog_1_grams, news_1_grams)
  #dfsources <- twitter_1_grams

  ggplot(dfsources, aes(x=Freq)) + 
  ggtitle("1-Gram Word Count vs Popularity") +
  stat_bin(boundary=20,breaks=seq(20,200,by=10)) +
  coord_cartesian(xlim = c(20, 200)) +
  scale_x_continuous(name="Popularity per Million Words (Occurrence Rate)",
                     limits=c(20, 200)) +
  scale_y_continuous(name="Number of Words") +
  facet_grid(. ~ Source)

2-Gram Phrases

These histograms show the lower number of popular phrases.

    twitter_2_grams <- get_2_gram_dictionary(twitraining,
                                             minimum_word_count=10,"Twitter")
    blog_2_grams <- get_2_gram_dictionary(blogtraining,
                                          minimum_word_count=10,"US_Blogs")
    news_2_grams <- get_2_gram_dictionary(newstraining,
                                          minimum_word_count=10,"US_News") 
    
    dfsources2 <- rbind(twitter_2_grams, blog_2_grams, news_2_grams) 

    ggplot(dfsources2, aes(x=Freq)) + 
    ggtitle("2-Gram Phrase Count vs Popularity") +
    stat_bin(boundary=2.5,breaks=seq(5, 1000, by=5)) +
    coord_cartesian(xlim = c(0, 100)) +
    scale_x_continuous(name=
                  "Popularity of Phrases per Million Words (Occurrence Rate)",
                  limits=c(2.5, 1000)) +
    scale_y_continuous(name="Number of Phrases") +
    facet_grid(. ~ Source)

    head(twitter_2_grams)

##         any3 Freq  Source
## 1     in the 1292 Twitter
## 2 Thanks for  893 Twitter
## 3     of the  877 Twitter
## 4    for the  806 Twitter
## 5     I love  791 Twitter
## 6      to be  773 Twitter

    head(blog_2_grams)

##      any3 Freq   Source
## 1  of the 2405 US_Blogs
## 2  in the 1835 US_Blogs
## 3  to the 1087 US_Blogs
## 4  on the  974 US_Blogs
## 5 and the  933 US_Blogs
## 6   to be  914 US_Blogs

    nums2 <- as.numeric(as.character(twitter_2_grams$Freq))
    
##  Here is a statistics summary for 2-gram phrases.   
    summary(as.numeric(as.character(twitter_2_grams$Freq)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   12.00   17.00   32.74   30.00 1292.00

3-Gram Phrases

    twitter_3_grams <- get_3_gram_dictionary(twitraining,
                                             minimum_word_count=3,"Twitter")
    blog_3_grams <- get_3_gram_dictionary(blogtraining,
                                          minimum_word_count=3,"US_Blogs")
    news_3_grams <- get_3_gram_dictionary(newstraining,
                                          minimum_word_count=3,"US_News") 
    dfsources <- rbind(twitter_3_grams, blog_3_grams, news_3_grams) 

    ggplot(dfsources, aes(x=Freq)) + 
    ggtitle("3-Gram Phrase Count vs Popularity") +
    stat_bin(boundary=2.5,breaks=seq(2.5,200.5,by=1)) +
    coord_cartesian(xlim = c(0, 23)) +
    scale_x_continuous(name=
                  "Popularity of Phrases per Million Words (Occurrence Rate)",
                  limits=c(0, 300)) +
    scale_y_continuous(name="Number of Phrases") +
    facet_grid(. ~ Source)

    head(twitter_3_grams)

##                 any3 Freq  Source
## 1     Thanks for the  469 Twitter
## 2     thanks for the  204 Twitter
## 3      Thank you for  183 Twitter
## 4 Looking forward to  175 Twitter
## 5          I want to  161 Twitter
## 6         I love you  131 Twitter

    head(blog_3_grams)

##          any3 Freq   Source
## 1  as well as  111 US_Blogs
## 2    a lot of  106 US_Blogs
## 3  one of the  102 US_Blogs
## 4   I want to   94 US_Blogs
## 5    I have a   90 US_Blogs
## 6 I have been   89 US_Blogs

    nums3 <- as.numeric(as.character(twitter_3_grams$Freq))
##  Here is a statistics summary for 3-gram phrases.  
    summary(as.numeric(as.character(twitter_3_grams$Freq)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.000   4.000   6.093   6.000 469.000

Word Prediction Stategy

Consider the string “Nothing in the world.” The last three words are “in the world.” Since for our problem we will not know what the final word is, we start with the string “Nothing in the” and use the last two words which are “in the” to predict the unknown final word.

Find the last known word:

str <- "Nothing in the"
patternlast <- "\\s([a-zA-Z']{1,50})[\\.]?$"
re <- regexec(patternlast,str)
lastword <- regmatches(str, re) [[1]][2]
lastword

## [1] "the"

Find the next to last Word:

The last two words form a 2-gram object, but here we only consider what might be predicted from the 3-gram model given that we know the first two words are “in the.”

patternnexttolast <- "\\s([a-zA-Z']{1,50})\\s[a-zA-Z']{1,50}[\\.]?$"
re <- regexec(patternnexttolast,str)
nexttolast <- regmatches(str, re) [[1]][2]
nexttolast

## [1] "in"

Goals for Shiny App and Algorithm

The 3-gram Model Prediction:

N-gram model for predicting the next word based on the previous 1, 2, or 3 words. We will also attempt to handle unseen n-grams too. We form a data frame that is suitable as input for a prediction model in the Carot Package. The first word of the 2-gram model might somehow be added to the same data frame but here we simply see that there a few phrases in our 3-gram data that we can choose a correct response from. Perhaps the most likely choice would be the 3-gram with the highest frequency of occurrence.

    col1 <- unlist(lapply(as.character(twitter_3_grams[,1]),
                          function(x) {strsplit(x, "\\s+")[[1]][1]}))
    col2 <- unlist(lapply(as.character(twitter_3_grams[,1]),
                          function(x) {strsplit(x, "\\s+")[[1]][2]}))
    col3 <- unlist(lapply(as.character(twitter_3_grams[,1]),
                          function(x) {strsplit(x, "\\s+")[[1]][3]}))
    data <- data.frame(col1, col2, col3, stringsAsFactors =FALSE)

##  Given 2-gram or two word combination: Find possible matching 3-gram from
##  the Twitter data set.
    paste(nexttolast,lastword)

## [1] "in the"

##  Find possible matching 3-gram from the Twitter data set.
    data[data[,1]==nexttolast & data[,2] == lastword,]

##      col1 col2         col3
## 51     in  the        world
## 125    in  the      morning
## 306    in  the       middle
## 337    in  the         face
## 831    in  the         next
## 1246   in  the          air
## 1247   in  the         city
## 1248   in  the         last
## 1620   in  the        first
## 1621   in  the         game
## 1622   in  the        house
## 1623   in  the      kitchen
## 1624   in  the         past
## 1625   in  the       studio
## 1626   in  the        works
## 2163   in  the         back
## 2164   in  the          car
## 2165   in  the         dark
## 2166   in  the          day
## 2167   in  the         fall
## 2168   in  the      history
## 2169   in  the       office
## 2170   in  the         same
## 2171   in  the         snow
## 2172   in  the          USA
## 2173   in  the          way
## 3195   in  the    afternoon
## 3196   in  the         area
## 3197   in  the       fridge
## 3198   in  the     hospital
## 3200   in  the       mirror
## 3201   in  the         mood
## 3202   in  the          NBA
## 3203   in  the neighborhood
## 3204   in  the      present
## 3205   in  the         rain
## 3206   in  the          sky
## 3207   in  the        store
## 3208   in  the       street
## 5124   in  the          ass
## 5125   in  the   background
## 5126   in  the      bedroom
## 5127   in  the         club
## 5128   in  the       corner
## 5129   in  the        final
## 5130   in  the       future
## 5131   in  the          lab
## 5132   in  the         mail
## 5133   in  the     military
## 5134   in  the        movie
## 5135   in  the         name
## 5136   in  the       second