This project aims to use text data collected from twitter, blogs and news sites to produce a predictive text model. So far I have downloaded and extracted the data; sampled and cleaned the data and performed exploratory analyse. I have written r scripts for each of these processes, which are available at my github account. https://github.com/Smeths/CapStoneProject
The following script downloads and extracts the data from “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip” if the data is not already present in my the directory. The script also extracts the data and downloads a “bad words” list created by google, which will be used to filter out profanity during the cleaning process.
source("01_download_extract.R")
knitr::kable(df_data_files,caption="Data Files")
| Blog File | News File | Twitter File |
|---|---|---|
| en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt |
The next script contains a function that produces random samples from each of the data files with a specified size and a specified random number. The samples are cleaned by removing profanity (using the file discussed above) and a data frame containing basic information, such as number of lines, for each data file is returned by the function.
source("02_clean_sample.R")
options(warn=-1)
df_data_info <- gen_sample(seed=150,nl=10)
options(warn=0)
knitr::kable(df_data_info,caption="Data File Info")
| Blog File | News File | Twitter File | |
|---|---|---|---|
| Number of Lines | 899288 | 1010242 | 899288 |
| Number of Words | 37334117 | 34365936 | 37334117 |
| Number of Bytes | 210160014 | 205811889 | 210160014 |
The sample files created previously have been loaded opened and a corpus created using the “tm” r package. The “tm” package is then used to further clean the data. Specifically:
Next a “Term Document Matrix” was created and the number of unique, unigrams, bigrams and trigrams are reported, along with sparsity of the term document matrix and other information.
stemming=FALSE
source("03a_prop_function.R",print.eval=TRUE)
source("03_corpus_exploratory.R",print.eval=TRUE)
kable(df_freq_info, caption="Frequency Analysis")
| Number Unique | Percent require for 50% Coverage | Percent require for 90% Coverage | Not Blank TDM | Blank TDM | |
|---|---|---|---|---|---|
| Unigrams | 8856 | 3.376242 | 60.61427 | 29755 | 13254245 |
| Bigrams | 31591 | 32.601057 | 86.52148 | 41583 | 47344917 |
| Trigrams | 39657 | 48.185692 | 89.63865 | 40885 | 59444615 |
| Quadgrams | 39444 | 49.736335 | 89.94777 | 39580 | 59126420 |
The number of unique terms; sparsity and the percentage of ngrams required to cover 50% and 90% of the ngram corpus increases as we go from unigrams to bigrams etc, which seems intuative. Plotting the frequency distribution gives further insights
The most frequent terms are as you would expect. E.g. “the” for unigrams, “of the” for bigrams etc. It is also clear that the shape of the distribution changes for different ngrams. Specifically, a large proportion of the unigrams are clustered around the most frequent words, however the distributions become much more dispersed for the bigrams, trigrams and quadgrams. Additionally, the frequencies for the unigrams seems to follow “Zipfs law” https://en.wikipedia.org/wiki/Zipf's_law, which is reassuring, although the bar for “that” is clearly less than a third of the height of the bar for “the”.
Using the american-english dictionary provided with Linux distributions the following script
source("03b_nonenglish.R")
Produces a list of non english terms (nonenglish.txt in my repo) used in the corpus. Most of them are slang terms, however the french word “voila” does appear. I’m not sure it is a good idea to try and remove all Non-English words as there are few of them and some, such as “voila”, could well be used fairly frequently in conversational English.
Stemming is a possible way of increasing coverage, i.e. use few words to cover a greater percentage of the corpus. Stemming involves reducing words to there root so jumping/jumped/jumps could all be represented by jump. The r “tm” package allows use of Porter stemming (http://snowball.tartarus.org/algorithms/porter/stemmer.html). The output below is a rerun of the frequency analysis above with Porter stemming incorporated.
stemming=TRUE
source("03a_prop_function.R",print.eval=TRUE)
source("03_corpus_exploratory.R",print.eval=TRUE)
kable(df_freq_info, caption="Frequency Analysis")
| Number Unique | Percent require for 50% Coverage | Percent require for 90% Coverage | Not Blank TDM | Blank TDM | |
|---|---|---|---|---|---|
| Unigrams | 6673 | 3.566612 | 49.91758 | 28216 | 9981284 |
| Bigrams | 30346 | 29.875437 | 85.97509 | 41478 | 45477522 |
| Trigrams | 39473 | 47.972031 | 89.59542 | 40845 | 59168655 |
| Quadgrams | 39406 | 49.718317 | 89.94316 | 39552 | 59069448 |
As would be expected the number of unique unigrams drops considerable, as does the number of unigrams required to cover 90% of the unigram corpus. However, the effect of stemming on bigram and trigram frequencies is considerably less pronounced.
The next steps will be to produce some preliminary models. The bases of the these models will be the calculation of conditional probabilities for the next word given the previous word(s), the words with the highest conditional probability will be used as predictions (some initial models can be seen in the files 04a_bigram_cp_model.R, 04b_trigram_cp_model.R, 04c_quadgram_cp_model.R, 04a_quingram_cp_model.R). The models will need to be tested for speed and accuracy in some way and then the “best” model will be choosen and used to produce an text prediction APP using shiny.