Natural Language Processing Project

The objective of this project is to create a predictive text model that reduces the number of required keystrokes and effectively predicts the next word typed based on word frequency and context. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This milestone report describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model. It meets the following benchmarks:

  1. Demonstrates the approach used to download and clean data.
  2. Creates a basic report of summary statistics about the data sets.
  3. Reports on some interesting findings.
  4. Gives feedback on our plan for creating a prediction algorithm and Shiny app.

Data Source

Data for this comes from a corpus called HC Corpora. See readme file details.

The corpus provides three types of sources: blog, news and twitter. For the purposes of this project, all sources will be assumed to be of equal quality, though there are some notable differences. For example, the twitter text data may contain more grammar errors and mispellings. Yet, on the other hand the focus on short topical phrases may make twitter text ideal for prediction of phrases with 2-4 words, the focus of this project.

All text data are provided in 4 different languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian. In this project, we will only focus on the English - United States data sets.

For this report we will be load the quanteda package which connects with several other r library functions.

    # clear any prior values in environment
    rm(list = ls())
    # Load or install packages used
    library(ggplot2) # enhanced grahics
    library(ggthemes) # advanced themes
    library(quanteda) # corpus tokenizer and more

1. Get and Clean Data

Since the twitter data contains emojis and symbols, it is important to remove non-ASCII characters and clean the data. Fortunately, the quanteda package provides the needed functionality to do word frequencies without extensive manual regex construction.

This allow for removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, stemming words and changing the text to lower case.

    # Download and unzip the data to local disk
    if (!file.exists("Coursera-SwiftKey.zip")) {
      download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
      unzip("Coursera-SwiftKey.zip")
    }
    # Define function to read and strip out non-ascii characters
    remove_nonasc <- function(file){
        print(file)
        # Read the data and force UTF-8 encoding
        text <- readLines(file, encoding = "UTF-8", skipNul = TRUE)
        # print max char
        print(max(nchar(text)))
        # find indices of words with non-ASCII characters
        nonascIndex <- grep("text.tmp", iconv(text, "latin1", "ASCII", sub="text.tmp"))
        # subset original vector of words to exclude words with non-ASCII char
        text <- text[-nonascIndex]
    }

    # load local files
    blogsData <- remove_nonasc("./Coursera-SwiftKey/final//en_US//en_US.blogs.txt")
    newsData <- remove_nonasc("./Coursera-SwiftKey/final//en_US//en_US.news.txt")
    twitterData <- remove_nonasc("./Coursera-SwiftKey/final//en_US//en_US.twitter.txt")

    # get file size info
    dir <-"./Coursera-SwiftKey/final//en_US//"
    filelist <- list.files(dir)
    # filelist

To speed up processing we extract a small percent of total records subset for exploratory purposes.

    # Sample the data using percentage approach to test coverage
    set.seed(416)
    #  replace is false, these probabilities are applied sequentially, that is the probability of choosing     #  the next item is proportional to the weights amongst the remaining items.
    data.sample <- c(sample(blogsData, length(blogsData) * 0.03, replace = FALSE),
                 sample(newsData, length(newsData) * 0.03, replace = FALSE),
                 sample(twitterData, length(twitterData) * 0.03, replace = FALSE))
    if (!file.exists("data.sample.Rdata")) {
    #ata.sample <- paste(newsData[1:5000], blogsData[1:5000], twitterData[1:5000])
    save("data.sample",file="data.sample.Rdata")
    # Save the memory for processing needed later
    rm(blogsData)
    rm(newsData)
    rm(twitterData)
    } else {
    #rm(blog)
    load("data.sample.Rdata")
    }

2. Data Exploration

To ontinue the analysis, we create 3 term-document matrices for a) unigrams, b) bigrams and c) trigrams. These are commonly referred to as n-grams, a contiguous sequence of n items from a given sequence of text or speech. The matrices created will serve for word prediction in the algorithm to be built in the next phase of our capstone project.

A. Unigrams - With filters

For the purposes of this report, we filter out basic words with high frequency such as “the”. We also use stemming to combine words with common root meanings. We also removed hashtags to combine online topics. This has the effect of increasing the twitter influence on word count.

Here are the top twenty words that appear most frequently in our sample:

    ## S3 method for class 'character' creates sparse data frame of unigrams
    mydf1 <- dfm(data.sample, verbose = TRUE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
             removeTwitter = TRUE, stem = TRUE, ignoredFeatures = c("will", stopwords("english")),
             keptFeatures = NULL, language = "english", thesaurus = NULL,
             dictionary = NULL, valuetype = c("glob", "regex", "fixed"))
    #user quanteda to get quick freq count
    top20unigrams <- topfeatures(mydf1, 20)  # 20 top words
    uni20_df <- data.frame(word=names(top20unigrams), freq=top20unigrams, row.names=NULL)
    rm(mydf1)
    # Define frequency plot function
    makePlot <- function(data, label) {
    ggplot(data[1:20,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 11, hjust = 1)) + theme_economist() +
         coord_flip() + 
         geom_bar(stat = "identity", fill = I("grey50"))
    }
    # show plot
    makePlot(uni20_df, "20 Most Common Unigrams")

B. Bigrams - With filters

The two word combinations called for more adjustments,in addition to the switch to the fastest answer, which has the impact of decreasing accuracy, the concatenator definition assures the character in between multi-word dictionary values is a blank not an underscore - which would really change the results returned.

So, here is the histogram of the 20 most common bigrams in the data sample, with these adjustments:

    # S3 method for class 'character' creates sparse data frame of bigrams
    mydf2 <- dfm(data.sample, ngrams=2, concatenator = " ",
             what = "fastestword", 
             verbose = FALSE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
             removeTwitter = FALSE,
             stem = FALSE, ignoredFeatures = c("will", stopwords("english")),
             keptFeatures = NULL, language = "english", thesaurus = NULL, 
             dictionary = NULL, valuetype = "fixed")
    #user quanteda to get quick freq count
    top20bigrams <- topfeatures(mydf2, 20) 
    bi20_df <- data.frame(word=names(top20bigrams), freq=top20bigrams, row.names=NULL)
    rm(mydf2)
    # show plot
    makePlot(bi20_df, "20 Most Common Bigrams")

C. Trigrams - With filters

Using a similar configuration, here is a histogram of the 20 most common trigrams in the data sample:

    # S3 method for class 'character' creates sparse data frame of bigrams
    mydf3 <- dfm(data.sample, ngrams=3, concatenator = " ",
             what = "fastestword",
             verbose = FALSE, toLower = TRUE,
             removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
             removeTwitter = FALSE, stem = FALSE, ignoredFeatures = c("will", stopwords("english")),
             keptFeatures = NULL, language = "english", thesaurus = NULL, 
             dictionary = NULL, valuetype = "fixed")
    #user quanteda to get quick freq count
    top20trigrams <- topfeatures(mydf3, 20) 
    tri20_df <- data.frame(word=names(top20trigrams), freq=top20trigrams, row.names=NULL)
    # plot a word cloud if min freq not in set then will plot all!!
    # plot(mydf1, max.words = 20,
    #      random.order = FALSE,
    #                    rot.per = .25, 
    #                    colors = RColorBrewer::brewer.pal(8,"Dark2"))
    rm(mydf3)
    makePlot(tri20_df, "20 Most Common Trigrams")

3. Interesting Observations that could make a difference

Having a data management strategy is key to being able to build a model.

Initially we attempted to use the more traditional tm package. However, after running into memory issues on this set we switched to Quanteda’s dfm() clearly faster functionality.

The tokenization in Quanteda is very conservative: by default, it only removes separator characters without additional definitions. So there are still strings and word combinations that are candidates for more regex scrubbing.

On a positive note, for fast content analysis, the quanteda package allows us to also look at similarities in data and other features such as building dictionaries of terms and meta-tagging content to create a richer search experience.

4. Next Steps - Model Development and Shiny App

Model development will include:

  • Creating the prediction algorithm.

  • Increasing the sample size

  • Optimizing the final corpus to achieve appropriate coverage and improve prediction accuracy

Then the Shiny app server algorithm will receive the typed or pasted text and perform the following actions:

  • Converting to lowercase
  • Removing non-ASCII characters, numbers, punctuation and white spaces
  • Filtering English stop words
  • Searching the last 2 words in the trigrams and retrieve matching patterns to send back to the UI
  • If not found, search the last word in bigrams and retrieve matching word patterns to send back to the UI
  • If not found, use smoothed probabilities to estimate the most likely words to follow
  • If not found, use backoff models to estimate the probability of unobserved n-grams

The idea here is to keep this simple. The shiny application is not geared toward long sentences or paragraphs which would require another modeling approach.

DataScience - Milestone Report http://rpubs.com/iwebconsultant/milestone-report