Our ultimate goal is to create a text prediction model similar to those that we are now used to encountering everyday in our phone keyboards or email composition tasks. First, we need to understand our raw data set, which in this case is: English text from blogs, news stories, and Twitter (now X) posts.
This report summarizes our initial exploratory analysis. It covers same basic stats, as well as key findings from cleaning and analyzing (exploring) the text data. Finally, we discuss our proposed plan for building a prediction algorithm. Feel free to us the navigation on the left to skip through the sections.
The source data is large, totaling over 550MB. To make the analysis fast and manageable (i.e. it has to run fast, on a smartphone), we first create a smaller, representative 2% random sample of the data. The tables below show a comparison of the full data set and the sample size we used for this report. We may adjust the sample size at a later time as we balance speed and accuracy of our model(s).
|
|
To build a model, we first try to understand the vocabulary - the words it contains. Our first key finding is that a small number of words make up the majority of the text. This is not surprising and in fact, is an expected result.
The plot below shows this relationship visually. We can see that we only need around 1,500 unique words to cover 80% of all word instances, and about 7,300 words to cover 90%. This is a crucial insight, as it confirms we can build an effective and accurate model with a relatively small dictionary, while also ensuring it is fast and efficient.
To find the most important “content” words, we first removed common “stop words” (like ‘the’, ‘of’, ‘it’, ‘a’ and ‘is’). This led us to the 2nd very important insight. After this step, we found that the most common unigrams, bigrams and trigrams were weird things like “1 2” and “2 2”. (Visuals and tables omitted from this summary). We then had to adjust our cleaning approach to also filter out those meaningless “words”. The word cloud below shows the final and most frequent meaningful words that remain. This gives us a confidence that we have a good sample dataset that is representative of the source text and a good base to start building prediction models from.
Our goal is to predict the next word a user might type. For this, we will use an n-gram backoff model.
The tables below show the top bigrams found in our sample data. The first table includes stop words and reveals the text’s structure, while the second table (with stop words removed) highlights meaningful content phrases. These frequencies will be at the core of our prediction model.
|
|
Finally, an insight we also discovered is that the three data sources have different characteristics. The Twitter data is ‘noisier’. It likely contains more slang and, as shown below, it contains more non-English words than the blogs or news data.
This is another valuable insight, as it tells us our model must be trained on a diverse sample that includes this informal text to be effective for our users. The chart below visualizes this by showing the top detected non-English languages in each source.
With this exploratory analysis now complete, our next steps will be to: