This is a progress report for the Coursera Data Science Specialisation capstone project. In this project, learners are required to create a predictive text app similar to those used on mobile phones. The project is organised in partnership with Coursera, Johns Hopkins Bloomberg School of Public Health and SwiftKey.
Research into natural language processing
Exploratory Data Analysis
Developing the prediction model
Implementing the model/algorithm into a Shiny App
Producing a slide deck describing the App and explaining key features.
This report will cover points 1 & 2 and suggest approach for remaining steps.
The following materials have been referred to extensively:
The data are provided as three large text files with “People” generated data from different sources:
## [1] "blogs" "news" "twitter"
Here is some general file information, without any pre-processing or cleanup.
| filename | filesize (MB) | number of lines | ave chars per line | max chars per line |
|---|---|---|---|---|
| en_US.blogs.txt | 200.4 | 899288 | 230 | 40833 |
| en_US.news.txt | 196.3 | 77259 | 202 | 5760 |
| en_US.twitter.txt | 159.4 | 2360148 | 69 | 140 |
It turns out that these data are simply too large to work with effectively and so a random sample is taken of 2.510^{4} multplied by the ratio of average characters per line to create samples of roughly equal size from each source. This is further split so that 20% can be kept aside as a validation set for estimating prediction accuracy.
| filename | filesize (MB) | number of lines | ave chars per line | max chars per line |
|---|---|---|---|---|
| en_US.blogs.txt | 4.4 | 19927 | 230 | 7375 |
| en_US.news.txt | 4.5 | 22886 | 202 | 5760 |
| en_US.twitter.txt | 4.5 | 66594 | 69 | 140 |
This table shows the summaries for the training set. It is shown that the average line length has been well preserved by the sampling process. The max line length is much less because only a subset of lines collected by the random sampling and it is unlikely to find the single longest line among these.
This training subset can now be used to create the corpus from which the prediction model will be developed.
At this point, it’s useful to determine if these three datasets have differing characteristics, as this could inform the approach for the rest of the assignment.
From a visual, manual check over the data set, it was generally found to be very noisy and not easy to analyse. The following preprocessing steps are taken to try to reduce the noise in the data:
The data is then tokenised into into different sizes of word unit n-grams for further work. An n-gram is simply an item of n contiguous text units (e.g. words, letters, phonemes). In this case an n-gram of size one is a single word, while an n-gram of size two is a pair of words that have been found adjacent to each other in the text.
Here is a selection of exploratory plots to help visualise similarities and differences in the data.
Here we see that twitter source has fewer distinct terms in all n-gram sizes while news has the highest number. Uniqueness increases with longer n-grams (as expected).
The x-axis labels are over-plotted for n-Gram size 1 & 2 so the labels have been rotated to make it possible to see where term repetitions climb into the 10’s, 100’s, 1000’s and even 10000’s simply by looking at the spread.
In general, Blogs appears to have the least repetition and Twitter the most. To make any more accurate claims in this area would require a more detailed analysis.
Finally, the word clouds above give a sense of the content in the 1-, 2- and 3-gram models. Keep in mind that the words have been stemmed and stop words removed, so they don’t follow a natural English. This was done to better understand frequency and repetition of terms but in the prediction model, unstemmed and stop words will be required.
Stop words and unstemmed terms will be used in the final model. This will result in bigger models.
There is a performance/accuracy trade-off from choosing a sample of the SwiftKey data. Other means, such as removing sparse terms will be investigated, to try to maximise the number of records used.
Some research will be done to create more than one algorithm for comparison.
Accuracy will be tested and measured using a hold out set. Some algorithms have weights, lambdas or some kind of tuning parameter. A test script will be created to enable multiple runs with ease.