Project Aims

This project aims to use text data collected from twitter, blogs and news sites to produce a predictive text model. So far I have downloaded and extracted the data; sampled and cleaned the data and performed exploratory analyse. I have written r scripts for each of these processes, which are available at my github account. https://github.com/Smeths/CapStoneProject

Downloading and Extracting the Data

The following script downloads and extracts the data from “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip” if the data is not already present in my the directory. The script also extracts the data and downloads a “bad words” list created by google, which will be used to filter out profanity during the cleaning process.

source("01_download_extract.R")
knitr::kable(df_data_files,caption="Data Files")
Data Files
Blog File News File Twitter File
en_US.blogs.txt en_US.news.txt en_US.twitter.txt

Cleaning and Sampling the Data

The next script contains a function that produces random samples from each of the data files with a specified size and a specified random number. The samples are cleaned by removing profanity (using the file discussed above) and a data frame containing basic information, such as number of lines, for each data file is returned by the function.

source("02_clean_sample.R")
options(warn=-1)
df_data_info <- gen_sample(seed=150,nl=10)
options(warn=0)
knitr::kable(df_data_info,caption="Data File Info")
Data File Info
Blog File News File Twitter File
Number of Lines 899288 1010242 899288
Number of Words 37334117 34365936 37334117
Number of Bytes 210160014 205811889 210160014

Exploratory Analysis

Further Cleaning and “ngram” Frequency Analysis

The sample files created previously have been loaded opened and a corpus created using the “tm” r package. The “tm” package is then used to further clean the data. Specifically:

  • Removing Punctuation
  • Removing Numbers
  • Converting to Lower Case
  • Removing Web Links

Next a “Term Document Matrix” was created and the number of unique, unigrams, bigrams and trigrams are reported, along with sparsity of the term document matrix and other information.

stemming=FALSE
source("03a_prop_function.R",print.eval=TRUE)
source("03_corpus_exploratory.R",print.eval=TRUE)
kable(df_freq_info, caption="Frequency Analysis")
Frequency Analysis
Number Unique Percent require for 50% Coverage Percent require for 90% Coverage Not Blank TDM Blank TDM
Unigrams 8856 3.376242 60.61427 29755 13254245
Bigrams 31591 32.601057 86.52148 41583 47344917
Trigrams 39657 48.185692 89.63865 40885 59444615
Quadgrams 39444 49.736335 89.94777 39580 59126420

The number of unique terms; sparsity and the percentage of ngrams required to cover 50% and 90% of the ngram corpus increases as we go from unigrams to bigrams etc, which seems intuative. Plotting the frequency distribution gives further insights

The most frequent terms are as you would expect. E.g. “the” for unigrams, “of the” for bigrams etc. It is also clear that the shape of the distribution changes for different ngrams. Specifically, a large proportion of the unigrams are clustered around the most frequent words, however the distributions become much more dispersed for the bigrams, trigrams and quadgrams. Additionally, the frequencies for the unigrams seems to follow “Zipfs law” https://en.wikipedia.org/wiki/Zipf's_law, which is reassuring, although the bar for “that” is clearly less than a third of the height of the bar for “the”.

Words from Foreign Languages

Using the american-english dictionary provided with Linux distributions the following script

source("03b_nonenglish.R")

Produces a list of non english terms (nonenglish.txt in my repo) used in the corpus. Most of them are slang terms, however the french word “voila” does appear. I’m not sure it is a good idea to try and remove all Non-English words as there are few of them and some, such as “voila”, could well be used fairly frequently in conversational English.

Increasing Coverage

Stemming is a possible way of increasing coverage, i.e. use few words to cover a greater percentage of the corpus. Stemming involves reducing words to there root so jumping/jumped/jumps could all be represented by jump. The r “tm” package allows use of Porter stemming (http://snowball.tartarus.org/algorithms/porter/stemmer.html). The output below is a rerun of the frequency analysis above with Porter stemming incorporated.

stemming=TRUE
source("03a_prop_function.R",print.eval=TRUE)
source("03_corpus_exploratory.R",print.eval=TRUE)
kable(df_freq_info, caption="Frequency Analysis")
Frequency Analysis
Number Unique Percent require for 50% Coverage Percent require for 90% Coverage Not Blank TDM Blank TDM
Unigrams 6673 3.566612 49.91758 28216 9981284
Bigrams 30346 29.875437 85.97509 41478 45477522
Trigrams 39473 47.972031 89.59542 40845 59168655
Quadgrams 39406 49.718317 89.94316 39552 59069448

As would be expected the number of unique unigrams drops considerable, as does the number of unigrams required to cover 90% of the unigram corpus. However, the effect of stemming on bigram and trigram frequencies is considerably less pronounced.

Plans for APP/Model

The next steps will be to produce some preliminary models. The bases of the these models will be the calculation of conditional probabilities for the next word given the previous word(s), the words with the highest conditional probability will be used as predictions (some initial models can be seen in the files 04a_bigram_cp_model.R, 04b_trigram_cp_model.R, 04c_quadgram_cp_model.R, 04a_quingram_cp_model.R). The models will need to be tested for speed and accuracy in some way and then the “best” model will be choosen and used to produce an text prediction APP using shiny.