Introduction
This report is my submission for the Johns Hopkins/Coursera Data Science Capstone, the final course in the Data Science Specialisation. It details some initial results from an exploratory data analysis of three text datasets taken from blogs, news and tweets provided by the developers of Microsoft Swiftkey (see here). These datasets will be used as the basis of a text prediction model for predicting the current or subsequent word.
I begin by looking at examples of the content and present some initial statistics about the datasets. I then move on to look at the frequencies of words and relationships between words within the sets. Methods for removing stop words, profanity and non-English terms are also explored. I conclude by presenting some initial ideas for the text prediction model and the interactive application that will be generated from it.
Loading and Checking Data for Cleanness
I began by downloading and unzipping the full dataset, then reading in all lines from the three English corpera (blogs, news, tweets). For this analysis I will only be dealing with the English-language data from the dataset. As a starting point, I performed an initial check on the lines to see if any contains NA values or empty strings, with the following results:
|
Corpus
|
Contains Not Available Lines
|
Contains Empty Lines
|
|
News
|
FALSE
|
FALSE
|
|
Blogs
|
FALSE
|
FALSE
|
|
Tweets
|
FALSE
|
FALSE
|
As we can see, all three sets appear to be relatively clean.
Comparison of the Three Datasets
Content
I now move on to take a random sample of five entries from each set, in order to get a sense of the difference in entries between blogs, news and tweets. First let’s look at the blogs dataset:
|
Line No
|
Text
|
|
1
|
The bruschetta however, missed the mark. Instead of manageable two-bite crostini, these were huge slices of grilled bread and heaped with toppings of tomato, cannellini beans and roasted peppers with goat cheese.
|
|
2
|
Walden Pond, Mt. Rainier, Big Sur, Everglades and so forth;
|
|
3
|
Despite laws banning cell phones while driving and increased awareness of the dangers of doing so, it’s a common fact that cell phone use while driving is still a widespread occurrence. Perhaps most discouraging to the issue is that much of this distracted driving occurs amongst young drivers, which is not only a safety concern, but also might indicate that the problem could be deeply rooted for future generations.
|
|
4
|
ghosts and goblins
|
|
5
|
Now I can write in specific post information for each day of the week, and “Pre-Plan” things out a bit! LOVE THAT! Love that it’s all in one place! Love that I finally got another little area of my life organized! Love that things are going to get easier for me now that I got my act together!
|
As would be expected, the blogs dataset appears to contain a wide range of content, from very short to relatively long lines, with both formal and informal content. Next the news dataset:
|
Line No
|
Text
|
|
1
|
Of course, Paul was 20 as a rookie and had played two full seasons at Wake Forrest. Irving’s college career consisted of 11 games at Duke.
|
|
2
|
William Nelson, 60, of Glendale worked for more than two decades for the city of Chicago, holding positions for the municipality’s streets and sanitation, solid-waste and public-works departments. After retiring in the early 1990s, Nelson and his wife, June, moved to Glendale and had been enjoying a modest retirement, occasionally visiting friends in California.
|
|
3
|
A four-star lineman, the 6-foot-4, 250-pound son of Greyhounds coach Biff Poggi earned first-team All-Metro honors last fall after making 49 tackles, 11 for a loss, and finishing the season with 10 sacks.
|
|
4
|
The standard 25,000-mile domestic frequent-flier ticket, an emblem of airline-loyalty programs for nearly two decades and still a selling point airlines advertise heavily, seems to be going the way of the in-flight meal.
|
|
5
|
Mary Beth Ohlms, camp director, said she, Cheryl Houston and Jaimette McCulley, all faculty members of the Human and Environmental Services department, orchestrated its plan for a new summer camp for students in grades four to six when the department facility underwent a makeover.
|
This set is the most linguistically ‘correct’ or well-written of the three, with more formal content and less chance of spelling mistakes. It also seems to contain generally longer sentences. Now to the tweets:
|
Line No
|
Text
|
|
1
|
just wanted to thank you & ask what got you started on your mission?
|
|
2
|
Right when I thought I was done… I ran of “sugar” for the last dessert
|
|
3
|
I tell ion gaf so why test my tolerance?
|
|
4
|
mayfly? Wish I was there. :)
|
|
5
|
follow me tho, so I can dm
|
In contrast to the news entries, it can be seen that tweets are on average shorter (max 140 chars), frequently contain abbreviations, emoji’s, misspellings and slang and are generally colloquial in nature. I also noticed some entries in Japanese within the corpus. The process of tokenisation and language detection described later in this report should serve to remove this content.
Lines and Line-lengths
To see if some of the above observations are bourne out by the data, I performed a comparison of the three datasets in terms of number of lines and line lengths (min, max and mean). I also counted the total number and unique number of words within each corpus. This involved splitting the lines into individual meaningful units of text (called tokens) for further analysis. In the R language this can be done using unnest_tokens function from the tidytext library, which generates one token per table row. In addition, punctuation is stripped and words are converted to lowercase to ensure like-to-like comparison (for more info on tidytext see here). The overall results can be seen below:
|
Corpus
|
No Of Lines
|
Min Line Length
|
Max Line Length
|
Mean Line Length
|
Total No Words
|
Unique No Words
|
|
Blogs
|
899288
|
1
|
40833
|
229.98695
|
37546250
|
320008
|
|
News
|
1010242
|
1
|
11384
|
201.16285
|
34762395
|
284533
|
|
Tweets
|
2360148
|
2
|
140
|
68.68045
|
30093372
|
370388
|
From the above we can see that most of our initial observations turned out to be correct. There were some surprises, such as the blogs set containing the longest lines (this was expected to be the news set) but in terms of mean line length there is not much difference between the two. The total number of words contained in each is also quite similar. In the news set, the number of unique words is lower, perhaps reflecting the more standardised nature of the content. It is also interesting to see that this is the highest within the tweets dataset, where more individualistic use of language is expected. In addition, the total number of words in the tweets set is lower than the other sets. This can be expected to drop lower still once slang terms are removed, so the end results may not be so fine-grained from this set as the others.
Finally, the minimum line length of 1 or 2 characters is quite strange and probably reveals unusable content, so this is something I chose to look at further, pulling out a random sample of lines under five characters in length:
|
Corpus
|
Text
|
Line Length
|
|
Blogs
|
|
3
|
|
Blogs
|
-SK
|
3
|
|
Blogs
|
7 ago
|
5
|
|
News
|
PBS
|
3
|
|
News
|
Play
|
4
|
|
News
|
Fiat
|
4
|
|
Tweets
|
ill’n
|
5
|
|
Tweets
|
Me:D
|
4
|
|
Tweets
|
i am
|
4
|
Arguably, such lines do not add much value and could be removed.
Comparative Histograms of Most Commonly Used Words
In this section I present graphs of word frequencies in each of the three datasets. I have limited the scope of the analysis to 10000 randomly sampled lines from each dataset, to make the analysis more computationally manageable. This also applies to following sections of this report.

As we can see, the results are polluted by a number of common words such as “the”, “and”, “to” etc which appear highly in each of the sets and prevent us from seeing the individuality in each one. Therefore for this part of the analysis I chose to filter out these called ‘stop-words’ using the lexicon provided in the previously mentioned tidytext library. It is worth noting that such stop-words will not be removed in the final model as they will also need to be predicted. Here are the word-frequency graphs again without these words included:

Now the differences between sets are far clearer, showing that terms such as “time” and “people” feature highly in the blogs and news sets. The blogs set seems to have a more personal focus (“i’m” and “feel”), news is more outward looking, with many words seeming to relate to regularly reported news-topics such as crime or sports (“police”, game”, “team”). The tweets set also contains “time” and “people” but these appear less frequently that familiar twitter slang such as “rt” (for retweet) or “lol”. As an overall approach, combining these three text styles should give good prediction results.
Profanity and Lanugage Filtering
A major goal of the project is to remove profane and non-english words from the corpus, as it is not desirable to have these appear as suggested next-words in a text prediction context. In this section I will present some basic methods for classifying single profane or non-english words, alongside some statistics about the prevalance of each in the corpera. I will then apply these same methods to the relationships between words (n-grams).
Profanity Filtering
Method
To filter out profanity, I simply compare each word in each dataset to those contained in a pre-defined lexicon of profane terms. The lexicon library in R contains five dictionaries of such terms, which I have combined and filtered so as to keep only single-word terms (i.e that don’t contain spaces). These words are then compared against the previously segmented single-word tokens from the corpera, keeping only those that do not match.
Summary Statistics: Comparison of Profanity in the Three Datasets
|
Corpus
|
No Of Found Words
|
No Of Found Profane Terms
|
Percentage Of Found Profane Words
|
|
Blogs
|
411538
|
2919
|
0.007
|
|
News
|
335826
|
2480
|
0.007
|
|
Tweets
|
125425
|
1544
|
0.012
|
From the above we can see that far less than 1% of the words from all datasets are deemed profane, according to the lexicon used. As would be expected, the percentage of profane terms found in the tweets dataset is slightly higher than in the other two sets, which are more formal.
We can also look at examples of the profane terms found the in the datasets, to see if they are what we would expect:
|
Blogs
|
News
|
Tweets
|
|
harder
|
lies
|
dirty
|
|
gypsies
|
killing
|
ugly
|
|
hard
|
thrust
|
cracker
|
|
nip
|
oral
|
backseat
|
|
pot
|
crotch
|
play
|
|
willy
|
hard
|
hell
|
|
crotch
|
hard
|
violence
|
|
playboy
|
blow
|
lmfao
|
From this we can see that the filter is somewhat over-tuned and regularly generates false-positives (e.g. the words “pot”, “hard”, “lies”, “killing” and “ugly” all have potential profane meanings, but could all be published without issue in a standard news article). This could be improved by some kind of human classification of the results, but is generally sufficient for our purposes. In effect this means we just lose the potential predictions of a few words, which is hopefully balanced out by the volume of content and is preferable to letting potentially profane terms through. It will be interesting to see whether any profane terms escape detection when implementing the model as an application.
Language Filtering
Method
The method here is very similar to the profanity-filtering approach, but instead I compare content to the GradyAugmented lexicon of english words from the qdapDictionaries library and content is kept if it matches, rather than rejected.
Summary Statistics: Comparison of Language-Filtering in the Three Datasets
|
Corpus
|
No Of Found Words
|
No Of Non English Words
|
Percentage Of Non English Words
|
|
Blogs
|
411538
|
16289
|
0.040
|
|
News
|
335826
|
17961
|
0.053
|
|
Tweets
|
125425
|
8814
|
0.070
|
Here we see that slightly less non-English words were found within the blogs dataset than the news or tweets one. It is also quite surprising that there are as many non-English words found in news as in tweets, where you would expect less in the former and more in the latter. This would seem to suggest that our lexicon needs some improvement. For all categories less than 1% of the total words are filtered out, so we can infer from this that our datasets do contain largely english-language words which we can use for predictive purposes.
|
Blogs
|
News
|
Tweets
|
|
london
|
paparazzi
|
thingsnottosayonthefirstdate
|
|
you’ll
|
espn
|
tryna
|
|
osteoporosis
|
quinn’s
|
hashtag
|
|
revlon
|
nordbye’s
|
thanx
|
|
they’d
|
sundays
|
p.m
|
|
people’s
|
renderings
|
manhaten
|
|
ang
|
mennemeier
|
haha
|
|
pps
|
sherrill
|
rt
|
Here we can see that matching to the Grady Augmented dictionary does a reasonable job of removing misspelled words (“thanx”) or company names (“revlon”), but also removes common words with apostrophes (such as “you’ll” or “they’d”) and names of places (such as “london”). In previous runs I have also observed removal of words in other scripts (“た”) but also the less-useful removal compound words such as “statehouse” or “chairlift”.
Filtering against a predefined lexicon perhaps most problematic for the tweets dataset, where many of the most commonly used terms would not identify as dictionary English (e.g. “hashtag”, “thanx”, “tryna”). It is an open question whether such terms should be suggested by our model, but if so then the lexicon must be expanded to include such terms (as far as possible). For better results against the news set in particular, it may be worth considering integrating the SCOWL (Spell Checker Oriented Word Lists) lexicon, more information on which can be found here
N-Gram Analysis Based on Relationships Between Words
Analysis of single words is useful when it comes to predicting the current word a user is typing. However, when it comes to predicting the next word a user might be looking for, it is necessary to consider the relationships between words. For this, the lines from the datasets will be split into n-grams i.e. contiguous sequences of n items. For the purposes of this report and our model, I will consider only two-word and three-word n-grams, otherwise known as bigrams and trigrams.
In the following sections I present examples of bigrams for each of the three datasets. In order reduce computation and the size of the bigrams on screen, I took random samples from the datasets (1000-2000 lines) and show only predicted word-relationships leading off from the first words of the top twenty word pairs. The bigrams have also been filtered for profanity and english-language using the techniques shown in the previous sections. For the real-world prediction model however, it is possible to imagine a data structure that represents the most frequent word-relationships contained within the full dataset.
Bigrams
In the case of single word analysis, the process of tokenisation splits the original lines into a column of words. I then count each of the occurrences of the word in the dataset to see which words are most frequently used. In the case of two-word sequences, it is necessary to count the relationship between word pairs i.e. the frequency of second words as they relate to the first. In this way bigrams essentially represent a table of Markov chains where each choice of word depends only on the previous word. When filtering for profanity, presence of digits and non-English words, I do so for both words in the pair, discarding the row if either of the pairs match, which effectively ignores the relationship.
Bigram: Blogs

Here we can see some of the more common word relationships from the blogs dataset, which could be used as a basis for prediction. For example, if the last word typed was “mental”, then “health”, “abilities” and “illness” would be displayed as potential next words to choose from.
Bigram: News

These results match well with the type of content we have seen from the news set previously and most of the word relationships make sense. Here we can see an example how the prediction algorithm might respond over the course of typing/selecting two words. Typing “golden” would lead to three suggestions, and if “gate” were typed or chosen then “bridge” would be suggested next.
Plans for the Prediction Model and Shiny App
From the analysis done so far, I feel as though I have most aspects in place to create a good quality prediction model. However, some questions still remain, such as:
- issues of coverage e.g. how to use a smaller number of words in the dictionary to cover the same number of phrases
- what to do when no predictions are found for a given word or words
- how to keep the model small so that it can run quickly and not take up too much memory
For the Shiny app, I currently imagine an interface with a text field and buttons which appear for the three or four most common words. When the user begins typing in the text field, real-time prediction would be performed based on the current cursor position. Different types of prediction would be performed based on a) if text is before/after the cursor or b) if space is before cursor. In the former case, the text surrounding the cursor would be used to suggest possible completions of the current word (i.e. single word analysis). In the latter case, the previous one or two words would be extracted and used to suggest the subsequent word (i.e. bigram or trigram analysis). When a suggested word is clicked it would then replace the text surrounding the cursor. As a test-feature, a toggle button could be included to switch the dataset used for prediction between blogs, news, tweets or all three.
Conclusion
In the course of this report I have covered approaches to single word and multi-word analysis on the dataset, as well as filtering of profanity and non-English words. I have presented some initial statistics and plots regarding three datasets which I hope act as good preparation for the stages to come and I believe bode well for producing a high-quality prediction model.