Natural Language Project

The objective of the overal project is to create a predictive text model that reduces the number of required keystrokes and effectively predicts the next word typed based on word frequency and context. This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the prediction model. In this report, we endeavor to:

-Properly download and clean text data. -Create a basic report of summary statistics about the data sets. -Do Exploratory Data Analysis -Report on some interesting findings. -Give some some feedback on a plan to create a prediction algorithm and Shiny app.

Data Source

Data for this project is from a corpus called HC Corpora.

The corpus provides three types of text data: blogs, news and twitter. For the purposes of this project, all sources will be assumed to be of equal quality, though there are some notable differences. For example, the twitter text data may contain more grammar errors and mispellings. Yet, on the other hand the focus on short topical phrases may make twitter text ideal for prediction of phrases with 2-4 words, the focus of this project.

All text data are provided in 4 different languages: German, English(United States), Finnish and Russian. In this project, we will only focus on the English - United States data sets.

For this report we will load the quanteda package which connects with several other R library functions.

####Load or install packages used
    library(ggplot2) # enhanced grahics
    library(ggthemes) # advanced themes
    library(quanteda) # corpus tokenizer and more
## quanteda version 0.99.22
## Using 3 of 4 threads for parallel computing
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
    library(stringi)

Some Summary Statistics

# load local files
blogs_size <- file.info("en_US.blogs.txt")$size
news_size <- file.info("en_US.news.txt")$size
twitter_size <- file.info("en_US.twitter.txt")$size

# In-memory size (in MB)
blogs_size
## [1] 210160014
news_size
## [1] 205811889
twitter_size
## [1] 167105338
####Words in lines
blogs_words <- stri_count_words(blogs_size)
news_words <- stri_count_words(news_size)
twitter_words <- stri_count_words(twitter_size)
max(blogs_words)
## [1] 1
max(news_words)
## [1] 1
max(twitter_words)
## [1] 1

Getting and Cleaning the Data

Since the twitter data contains emojis and symbols, it is important to remove non-ASCII characters and clean the data. Fortunately, the quanteda package provides the needed functionality to do word frequencies without extensive manual regex construction.

Natural language processing techniques will be used to perform the analysis and build the predictive model. This allows for removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, stemming words and changing the text to lower case.

# Download and unzip the data to local disk
    if (!file.exists("Coursera-SwiftKey.zip")) {
      download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
      unzip("Coursera-SwiftKey.zip")
    }
    # Define function to read and strip out non-ascii characters
    remove_nonasc <- function(file){
        print(file)
        # Read the data and force UTF-8 encoding
        text <- readLines(file, encoding = "UTF-8", skipNul = TRUE)
        # print max char
        print(max(nchar(text)))
    
        # find indices of words with non-ASCII characters
        nonascIndex <- grep("text.tmp", iconv(text, "latin1", "ASCII", sub="text.tmp"))
        # subset original vector of words to exclude words with non-ASCII char
        text <- text[-nonascIndex]
    }

## Summary Stats

  # load local files deleting non-ascoo items
    blogsData <- remove_nonasc("en_US.blogs.txt")
## [1] "en_US.blogs.txt"
## [1] 40833
    newsData <- remove_nonasc("en_US.news.txt")
## [1] "en_US.news.txt"
## [1] 11384
    twitterData <- remove_nonasc("en_US.twitter.txt")
## [1] "en_US.twitter.txt"
## [1] 140
  # Memory
  object.size(blogsData)
## 154428320 bytes
  object.size(newsData)
## 220678896 bytes
  object.size(twitterData)
## 304364720 bytes
  ## Max Words in lines
  max(stri_count_words(blogsData))
## [1] 1657
  max(stri_count_words(newsData))
## [1] 1796
  max(stri_count_words(twitterData))
## [1] 47
   # Length os Test files
   length(blogsData)
## [1] 636261
   length(newsData)
## [1] 874278
   length(twitterData)
## [1] 2282717

To speed up processing we extract a small percent of total records subset for exploratory purposes

    # Sample the data using percentage approach to test coverage
    set.seed(416)
  
    # replace is false, these probabilities are applied sequentially, that is the probability of choosing the next item is         proportional to the weights amongst the remaining items.
    data.sample <- c(sample(blogsData, length(blogsData) * 0.03, replace = FALSE),
        sample(blogsData, length(blogsData) * 0.03, replace = FALSE),
        sample(newsData, length(newsData) * 0.03, replace = FALSE),
        sample(twitterData, length(twitterData) * 0.03, replace = FALSE))
    if (!file.exists("data.sample.Rdata")) {
        data.sample <- paste(blogsData[1:5000], newsData[1:5000], twitterData[1:5000])
        save("data.sample", file="data.sample.Rdata")
    } else {
    load("data.sample.Rdata")
  
    }

Data Exploration

To continue the analysis, we create 3 term-document matrices for a) unigrams, b) bigrams and c) trigrams. These are commonly referred to as n-grams, a contiguous sequence of n items from a given sequence of text or speech. The matrices created will serve for word prediction in the algorithm to be built in the next phase of our capstone project.

A. Unigrams - With filters

For the purposes of this report, we filter out basic words with high frequency such as “the”. We also use stemming to combine words with common root meanings. We also removed hashtags to combine online topics. This has the effect of increasing the twitter influence on word count.

Here are the top fifteen words that appear most frequently in our datasample, with adjustments:

    ## Creates sparse data frame of unigrams
    mydf1 <- dfm(data.sample, ngrams=1, verbose = TRUE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
             removeTwitter = TRUE, stem = TRUE, ignoredFeatures  = stopwords("english"),
             keptFeatures = NULL, language = "english", thesaurus = NULL,
             dictionary = NULL, valuetype = c("glob", "regex", "fixed"))
## Creating a dfm from a character input...
## Warning: Arguments toLower, removeNumbers, removePunct, removeSeparators,
## removeTwitter, ignoredFeatures, keptFeatures, language not used.
##    ... lowercasing
##    ... found 113,796 documents, 98,858 features
##    ... stemming features (English)
## , trimmed 23730 feature variants
##    ... created a 113,796 x 75,128 sparse dfm
##    ... complete. 
## Elapsed time: 15.8 seconds.
    #user quanteda to get quick freq count
    top15unigrams <- topfeatures(mydf1, 15)  # 15 top words
    uni15_df <- data.frame(word=names(top15unigrams), freq=top15unigrams, row.names=NULL)
    rm(mydf1)
    # Define frequency plot function
    makePlot <- function(data, label) {
    ggplot(data[1:15,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 11, hjust = 1)) + theme_economist() +
         geom_bar(stat = "identity", fill = ("blue")) + coord_flip()
    }
    
    rm(mydf1)
## Warning in rm(mydf1): object 'mydf1' not found
    # show plot
  
    makePlot(uni15_df, "15 Most Common Unigrams")

Bigrams - With filters

The two word combinations called for more adjustments,in addition to the switch to the fastest answer, which has the impact of decreasing accuracy, the concatenator definition assures the character in between multi-word dictionary values is a blank not an underscore - which would really change the results returned.

Here is a histogram showin the 15 most common bigrams in the data sample, with adjustments:

    # create sparse data frame of bigrams
      mydf2 <- dfm(data.sample, ngrams=2, concatenator = " ",
             what = "fastestword", 
             verbose = FALSE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
             removeTwitter = FALSE,
             stem = FALSE, ignoredFeatures = stopwords("english"),
             keptFeatures = NULL, language = "english", thesaurus = NULL, 
             dictionary = NULL, valuetype = "fixed")  
## Warning: Arguments toLower, removeNumbers, removePunct, removeSeparators,
## removeTwitter, ignoredFeatures, keptFeatures, language not used.
    # user quanteda to get freq count
        top15bigrams <- topfeatures(mydf2, 15) 
        bi15_df <- data.frame(word=names(top15bigrams), freq=top15bigrams, row.names=NULL)
        
        rm(mydf2)
    
    # show plot
        makePlot(bi15_df, "15 Most Common Bigrams")

Trigrams - With filters

Using a similar configuration, here is a histogram of the 15 most common trigrams in the data sample:

#### Creating sparse data frame of trigrams
        mydf3 <- dfm(data.sample, ngrams=3, concatenator = " ",
             what = "fastestword", 
             verbose = FALSE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
             removeTwitter = FALSE,
             stem = FALSE, ignoredFeatures = stopwords("english"),
             keptFeatures = NULL, language = "english", thesaurus = NULL, 
             dictionary = NULL, valuetype = "fixed")
## Warning: Arguments toLower, removeNumbers, removePunct, removeSeparators,
## removeTwitter, ignoredFeatures, keptFeatures, language not used.
    #### ser quanteda to get freq count
        top15trigrams <- topfeatures(mydf3, 15) 
        tri15_df <- data.frame(word=names(top15trigrams), freq=top15trigrams, row.names=NULL)
        
        rm(mydf3)
    
    # show plot
        makePlot(tri15_df, "15 Most Common Trigrams")

Observations

Having a data management strategy is key to being able to build a model.

Initially we attempted to use the more traditional tm package. However, after running into memory issues on this set we switched to Quanteda’s dfm() clearly faster functionality.

The tokenization in Quanteda is very conservative: by default, it only removes separator characters without additional definitions. So there are still strings and word combinations that are candidates for more regex scrubbing.

On a positive note, for fast content analysis, the quanteda package allows us to also look at similarities in data and other features such as building dictionaries of terms and meta-tagging content to create a richer search experience.

Model development will include:

Creating the prediction algorithm. Increasing the sample size Optimizing the final corpus to achieve appropriate coverage and improve prediction accuracy

Then the Shiny app server algorithm will receive the typed or pasted text and perform the following actions:

Converting to lowercase

Removing non-ASCII characters, numbers, punctuation and white spaces

Filtering English stop words

Searching the last 2 words in the trigrams and retrieve matching patterns to send back to the UI

If not found, search the last word in bigrams and retrieve matching word patterns to send back to the UI If not found, use smoothed probabilities to estimate the most likely words to follow If not found, use backoff models to estimate the probability of unobserved n-grams

The idea here is to keep this simple. The shiny application is not geared toward long sentences or paragraphs which would require another modeling approach.