Assignment: Final Project

Summary

This report, submitted in partial fulfillment of the requirements of the Coursera course, “Data Science Capstone”, concerns the R package “wordprediction”, which contains functions for an auto-completion model that predicts the next word to be typed based on a word or phrase previously typed. The source code can be found at https://github.com/Barkia/Final-Project-Submission, a presentation at http://rpubs.com/Barkhia/wordprediction, and a demonstration at https://ahmedbarkhia.shinyapps.io/word-prediction/.

Auto-completion is a common function on mobile devices. As a user types, an auto-completion function presents that user with possible completions to the current word being typed or probable words that could follow the current word or phrase after it is typed. The package “wordprediction” provides the latter function.

Exploratory Analysis

Data File

In order to build a function that can provide word-prediction, a predictive model is needed. Such models use known content to predict unknown content. For this package, that content comes from the HC Corpora collection, which is “a collection of corpora for various languages freely available to download” (Christensen, n.d.). The version used was obtained from an archive maintained at Coursera (Leek, Peng, Caffo, & Johns Hopkins University, n.d.). The file included three text document collections, blogs, news feeds, and tweets, in four languages, German, English, Finnish, and Russian, of which only the English collections were used. The files were too large to be manipulated using a home computer (e.g., the downloaded ZIP file was 575 MB). Therefore, 1000 lines were randomly sampled from each collection using a Mac OS X (Version x86_64-apple-darwin13.4.0) terminal application before loading for analysis in RStudio (Version 0.99.892) running the R statistical programming language (Version 3.2.3).

Data Structure

This package uses the data stuctures described in Feinerer, Hornik, and Meyer (2008) from the Text Mining Package (Version 0.6-2; Feinerer, Hornik, & Artifex Software, Inc., 2015) “tm”. In these structures, text document collections are organized into corpora, the basic objects to be manipulated by the word-prediction function. Accordingly, the HC Corpora data file were loaded as tm corpora.

Data Cleaning

Data cleaning involves transforming the raw text in the corpus into a format more suitable for automated manipulation. The tm package provides numerous functions for such transformations (see Feinerer et al., 2008, p. 9). For this package, the texts were converted to lower case, stripped of whitespace, and common stopwords (i.e., words so common that they contain little information; see Feinerer et al., 2008, pp. 25-26) were removed.

Data Features

From the cleaned English corpus, a term-document matrix (TDM) was created, which is a matrix of words or phrases and their frequencies in a corpus. For this corpus, the range of frequencies for words was 1 to 256. The most important words were defined as those words not appearing too infrequently or too frequently. Reasoning like Feinerer et al. (2008):

we take values around 30, since smaller values for this corpus tend to be already negligible due to the large number of documents. On the other side bigger values tend to be too common in most of the newsgroup postings.  (p. 39)

These important words are as follows:

##  [1] "across"  "decided" "found"   "fun"     "guys"    "hard"    "lol"    
##  [8] "please"  "program" "thank"   "trying"

A table was then created that shows the number of unique words and the summed frequency of those words for each of the three text collections, as follows:

##              Blogs News Feeds Tweets
## Unique Words  8030       7642   3271
## Sum of Words 22589      19059   6992

Finally, in order to check whether the size of the samples of the collections, which was 1000 each, was sufficient, the term-document matrix was checked to see if they obeyed two laws from linguistics concerning large corpora. First, “Zipf’s law … states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table” (Explore Corpus Term Frequency Characteristics, n.d.).

Figure 1. Zipf’s law holds for this matrix, especially for higher ranks, as shown by the adherence of the curve obtained from the data to the theoretical diagonal line.

Second, “Heaps’ law … states that the vocabulary size V (i.e., the number of different terms employed) grows polynomially with the text size T (the total number of terms in the texts)” (Explore Corpus Term Frequency Characteristics, n.d.).

Figure 2. Heap’s law holds extremely for this matrix, as shown by the near complete adherence of the line obtained from the data to the theoretical diagonal line.

Data Processing

According to Wikipedia (N-gram, n.d.), “an n-gram is a contiguous sequence of n items from a given sequence of text or speech.” This package takes a key word or phrase, matches that key to the most frequent n-1 term found in a TDM of n-word terms, and returns the nth word of that item.

Of course, not all possible words or phrases exist in the corpus from which the TDM was derived. For this reason, a simplified Katz’s back-off model is used, which backs off to smaller n-grams when a key is not found in the larger n-gram. The maximum n-gram handled is a trigram. The word returned is the match found in the largest n-gram where the key is found. When the key is not found in the unigram, the most common word in the corpus “will” is returned. This function is demonstrated using a Shiny app hosted on shinyapps.io at https://ahmedbarkhia.shinyapps.io/word-prediction/.

Conclusion

This report has shown features the R package “wordprediction”. It was designed using samples of 1000 words each from a corpus of collections English words. The corpus has a large range of words and word frequencies with a number of important words. The sample size was large enough to satisfy two laws from linguistics concerning large corpora. If the desired sample size is deemed to be too small, it can easilly be increased by editing the source code. As is, this analysis ran rather quickly on a home computer and on shinyapps.io. As shown in a demonstration, all phrases and words submitted to the function “katz_backoff_model” result in a prediction in the form of a single word returned.

References

Christensen, H. (n.d.). HC Corpora. Retrieved from http://www.corpora.heliohost.org

Explore Corpus Term Frequency Characteristics [Computer software]. (n.d.). Retrieved from http://www.inside-r.org/packages/cran/tm/docs/Zipf_n_Heaps

Feinerer, I., Hornik, K., & Artifex Software, Inc. (2015, July 2). Text Mining Package [Computer software]. Retrieved from http://tm.r-forge.r-project.org

Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54. http://doi.org/citeulike-article-id:2842334

Leek, J., Peng, R., Caffo, B., & Johns Hopkins University (n.d.). Data Science Capstone, Coursera. Retrieved from https://www.coursera.org/learn/data-science-project/