Logo

adrián álvarez del castillo
April 30, 2016

This is an R Markdown document.


Exploratory analysis

Summary

The present study has the purpose of initiate the exploratory analysis of the provided data. The first goal is to identify some of the main features within the three kind of files (blog, news & twitter). Some general tools like the histograms and statistical summaries are provided, but also somo very field-specific tools are used, such as the tag cloud diagrams. The second goal is to provide a very brief reference for the next steps in the development of the predictive algorithm and eventually a related Shiny app.

Some definitions are included to provided a general understanding of specific concepts. In that case, sources as referred.

Setup environment

Prior to execute anyone of the taske, all the required libraries are loaded.

Load data

The data was direclty downloaded from the course webpage, as referred in Task 0 in the following url: (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

General statistics

Some basic statistics like the file sizes and the number of lines and characters in the documents ara gathered. This is an important step because some sense and initial expectations of the data can be figured out.

##   size.blog size.news size.twit
## 1  200.4242  196.2775  159.3641
##             count.blog count.news count.twit
## Lines           899288    1010242    2360148
## LinesNEmpty     899288    1010242    2360148
## Chars        206824382  203223154  162096031
## CharsNWhite  170389539  169860866  134082634

Words per line

The number of words per line it’s an important feature of the documents. This variable can dramatically change from one type to another, and reflects some constraints and style about the kind of source. An histogram is good tool for representing this variable, however the outliers can be a problem since they mmight spread the range very widely. For each case, the top percentile will be removed from the data prior to building the histogram.

Blog data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

News data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

Twitter data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

Data sampling

Since the documents are very large, a random sample of records will be selected from each file.

This samples are written to disk for further processing.

Create a corpus

Once the data from the three files were sampled a corpus was generated merging the data.

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. (https://en.wikipedia.org/wiki/Text_corpus)

Data cleansing

To clean the corpus the tm_map function from the tm package was applied with several options for:

  • Change the case
  • Remove punctuation
  • Remove numbers
  • Strip white spaces
  • Remove words (stop and even profane)
  • Stem the corpus (reduce words that have a common root)

A profane words list was obtained from a research group at CMU. (http://www.cs.cmu.edu/~biglou/resources/)

n-gram Tokenization

To determine word frequencies, n-gram models are utilized. Using the RWeka library, unigrams, bigrams and trigrams are created.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles. An n-gram of size 1 is referred to as a unigram; size 2 is a bigram; size 3 is a trigram. Larger sizes are sometimes referred to by the value of n, e.g., four-gram, five-gram, and so on.

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model. n-gram models are now widely used in probability, communication theory, computational linguistics , computational biology, and data compression. Two benefits of n-gram models and algorithms that use them, are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently. (https://en.wikipedia.org/wiki/N-gram)

Most frequent n-grams

For representing the most frequent n-grams the tag cloud representation was chosen.

A tag cloud (word cloud, or weighted list in visual design) is a visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence. (https://en.wikipedia.org/wiki/Tag_cloud)

Uni-grams

Bi-grams

Tri-grams

Future development

The n-grams are a great start to provide a priori knowledge for typing patterns. Combine them with mixture models could be a great tool for word predicting.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with “mixture distributions” relate to deriving the properties of the overall population from those of the sub-populations, “mixture models” are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. (https://en.wikipedia.org/wiki/Mixture_model)


The following are some general steps to consider:


(aac)