This report describes the approach and activities for the Data Sciences specialisation Capstone project. The objective of the project is to develop a text prediction application like those used on mobile telephones and app’s like WhatsApp. This report covers the initial data analysis and high-level plans for the application. It covers:
1 Loading the data
2 Capture and report key statistics of the data sets
3 Take a workable sample of the data
4 Clean and preprocess the data
5 Apply profanity filter
6 Tokenization
7 Frequency analysis and 2 and 3-grams
8 Findings & Conclusions
9 Way forwards
Initial exploratory analysis was done by reading a lot about NLP from different sources on the Internet. Familiarized myself with some on the R packages providing NLP functionality NLP, OpenNLP, RWeka, tm, tokenizers. Next the text for the 3 input datasets were read to investigate its basic properties and format. Results of this analysis can be found in the table below. Due to the significant size of the input data the analysis was made on a sample size of 5% of the 3 input files.
Sample characteristics:
| statistic | News | Blogs | unit | |
|---|---|---|---|---|
| Original size | 167 | 206 | 210 | Mb |
| Max. record length | 140 | 1105 | 2784 | characters |
| Records in sample | 118007 | 50512 | 44964 | records |
For cleansing and tokenization of the text the tm package in R was utilized plus regular expressions using the gsub function. To make the texts ready for natural language processing the following steps were executed to clean up the data:
For the profinity filter a “bad word” list is used from the following source: https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
To tokenize the dataset the termFreq function was used to determine word counts and frequencies.
More statistics after the preprocessing of the text:
| statistic | News | Blogs | unit | |
|---|---|---|---|---|
| Max. record length | 139 | 1079 | 2718 | characters |
| Lines in sample | 118007 | 50512 | 44964 | lines |
| Total words | 1125408 | 1367324 | 1447389 | words |
| Unique words | 9443 | 14871 | 14604 | words |
Now the basic text is cleaned and tokenized we can extract information out of the text samples. First is the top 25 word count visualized, followed by the n-grams of our sample data.
To be able to predict next words it is important to understand how words in sentences are sequenced. To visualize this we create n-grams showing the frequency of word sequences of 2 and 3 words long.
| statistic | News | Blogs | unit | |
|---|---|---|---|---|
| Number of 2-grams | 38707 | 65264 | 69327 | occurences |
| Number of 3-grams | 46778 | 84911 | 97590 | occurences |
Based in the exploratory analysis and initial mining of the text the following observations were made:
The first step in the modelling is combine the results of the 3 test datasets and cut the less frequent works and n-grams from the analysis results to get an efficient but still effective text prediction application.
A basic model for test prediction is to convert occurrence frequencies into a chances. New words or parts of them are compared to the frequency tables and the word with the highest likelihood is predicted. Higher order n-grams take priority over lower order n-grams. So the priority is take the 3-grams prediction, thereafter the 2-grams and finally the 1-grams prediction.
For unseen word sequences, words or parts of words the model should in my opinion not predict but instead add new options to the dictionary to allow the prediction model to learn from your own writing style or vocabulary.
Testing of the model can be achieved by monitoring the matches (= successful predictions) against failed predictions (= unseen words + wrong predictions).