Capstone Project - Exploratory Analysis

Project Aims

This project aims to use text data collected from twitter, blogs and news sites to produce a predictive text model. So far I have downloaded and extracted the data; sampled and cleaned the data and performed exploratory analyse. I have written r scripts for each of these processes, which are available at my github account. https://github.com/Smeths/CapStoneProject

Downloading and Extracting the Data

The following script downloads and extracts the data from “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip” if the data is not already present in my the directory. The script also extracts the data and downloads a “bad words” list created by google, which will be used to filter out profanity during the cleaning process.

source("01_download_extract.R")
knitr::kable(df_data_files,caption="Data Files")

Data Files
Blog File	News File	Twitter File
en_US.blogs.txt	en_US.news.txt	en_US.twitter.txt

Cleaning and Sampling the Data

The next script contains a function that produces random samples from each of the data files with a specified size and a specified random number. The samples are cleaned by removing profanity (using the file discussed above) and a data frame containing basic information, such as number of lines, for each data file is returned by the function.

source("02_clean_sample.R")
options(warn=-1)
df_data_info <- gen_sample(seed=150,nl=10)
options(warn=0)
knitr::kable(df_data_info,caption="Data File Info")

Data File Info
	Blog File	News File	Twitter File
Number of Lines	899288	1010242	899288
Number of Words	37334117	34365936	37334117
Number of Bytes	210160014	205811889	210160014

Exploratory Analysis

Further Cleaning and “ngram” Frequency Analysis

The sample files created previously have been loaded opened and a corpus created using the “tm” r package. The “tm” package is then used to further clean the data. Specifically:

Removing Punctuation
Removing Numbers
Converting to Lower Case
Removing Web Links

Next a “Term Document Matrix” was created and the number of unique, unigrams, bigrams and trigrams are reported, along with sparsity of the term document matrix and other information.

stemming=FALSE
source("03a_prop_function.R",print.eval=TRUE)
source("03_corpus_exploratory.R",print.eval=TRUE)
kable(df_freq_info, caption="Frequency Analysis")

Frequency Analysis
	Number Unique	Percent require for 50% Coverage	Percent require for 90% Coverage	Not Blank TDM	Blank TDM
Unigrams	8856	3.376242	60.61427	29755	13254245
Bigrams	31591	32.601057	86.52148	41583	47344917
Trigrams	39657	48.185692	89.63865	40885	59444615
Quadgrams	39444	49.736335	89.94777	39580	59126420

The number of unique terms; sparsity and the percentage of ngrams required to cover 50% and 90% of the ngram corpus increases as we go from unigrams to bigrams etc, which seems intuative. Plotting the frequency distribution gives further insights

The most frequent terms are as you would expect. E.g. “the” for unigrams, “of the” for bigrams etc. It is also clear that the shape of the distribution changes for different ngrams. Specifically, a large proportion of the unigrams are clustered around the most frequent words, however the distributions become much more dispersed for the bigrams, trigrams and quadgrams. Additionally, the frequencies for the unigrams seems to follow “Zipfs law” https://en.wikipedia.org/wiki/Zipf's_law, which is reassuring, although the bar for “that” is clearly less than a third of the height of the bar for “the”.

Words from Foreign Languages

Using the american-english dictionary provided with Linux distributions the following script

source("03b_nonenglish.R")

Produces a list of non english terms (nonenglish.txt in my repo) used in the corpus. Most of them are slang terms, however the french word “voila” does appear. I’m not sure it is a good idea to try and remove all Non-English words as there are few of them and some, such as “voila”, could well be used fairly frequently in conversational English.

Increasing Coverage

Stemming is a possible way of increasing coverage, i.e. use few words to cover a greater percentage of the corpus. Stemming involves reducing words to there root so jumping/jumped/jumps could all be represented by jump. The r “tm” package allows use of Porter stemming (http://snowball.tartarus.org/algorithms/porter/stemmer.html). The output below is a rerun of the frequency analysis above with Porter stemming incorporated.

stemming=TRUE
source("03a_prop_function.R",print.eval=TRUE)
source("03_corpus_exploratory.R",print.eval=TRUE)
kable(df_freq_info, caption="Frequency Analysis")

Frequency Analysis
	Number Unique	Percent require for 50% Coverage	Percent require for 90% Coverage	Not Blank TDM	Blank TDM
Unigrams	6673	3.566612	49.91758	28216	9981284
Bigrams	30346	29.875437	85.97509	41478	45477522
Trigrams	39473	47.972031	89.59542	40845	59168655
Quadgrams	39406	49.718317	89.94316	39552	59069448

As would be expected the number of unique unigrams drops considerable, as does the number of unigrams required to cover 90% of the unigram corpus. However, the effect of stemming on bigram and trigram frequencies is considerably less pronounced.

Plans for APP/Model

The next steps will be to produce some preliminary models. The bases of the these models will be the calculation of conditional probabilities for the next word given the previous word(s), the words with the highest conditional probability will be used as predictions (some initial models can be seen in the files 04a_bigram_cp_model.R, 04b_trigram_cp_model.R, 04c_quadgram_cp_model.R, 04a_quingram_cp_model.R). The models will need to be tested for speed and accuracy in some way and then the “best” model will be choosen and used to produce an text prediction APP using shiny.