The corpus for this project consists of text in four languages (English, Russian, German, and Finnish) from three types of sources (blogs, news articles, and Twitter). For this exploratory data analysis, I will use a sample of the English language portion of the corpus.
The Twitter portion of the English language corpus contains 2360148 lines of text, and the blogs and news portions contain 899288 and 1010242 lines, respectively. I wrote a function that selects a random 1% of each portion and saves the selected lines in new text files. For this exploratory analysis and my preliminary modeling, I will work with these subsets of the corpus.
In order to explore the corpus and using it to build a word-predicting model, we need to do some processing of the data. The first step is tokenizing, or separating the lines of text into individual words. To do this, we use a function (unnest_tokens, from the tidytext R package) which also converts all letters to lowercase and strips out punctuation.
We will begin our exploration of the sample of the corpus by examining the word counts in each line of each type of text.
| min | max | mean | median | |
|---|---|---|---|---|
| blogs | 1 | 520 | 41.70363 | 29 |
| news | 1 | 250 | 33.93932 | 31 |
| 1 | 35 | 12.80738 | 12 |
Then we examine to see which words are most common in each type of text. In natural language processing, it is common for data scientists to remove "stopwords" (words such as articles and prepositions which don't carry much meaning). In this case, I have chosen not to remove stopwords because in this prediction task, we will want to include such words in our predictions. We will want to remove profanity and other offensive words, but we will leave this task for a later point in the modeling process.
We graph the 15 most common words from each type of text:
It is interesting to note the pattern in pronouns that appear in the 15 most common words in each type of source. "I" and "you" are among the 15 most common words in both blogs and twitter text, but not among news text. Conversely, "he" (not "she") is one of the 15 most common words in news text but not blogs or twitter. This is not surprising, given that news articles report on what people said or did, whereas in blog and twitter posts, the author writes about their own thoughts or experiences for an audience of one or more people.
Since we will be predicting the next word given a phrase, we need to know about what kinds of word combinations are most likely to occur in text. An n-gram is a sequence of n words appearing together in the text. For now, we will examine the 2-grams and 3-grams in our sample of the English corpus.
I plan to use an n-gram back-off model with some sort of smoothing. Roughly speaking, this means that the model will look at inputted text, use the largest useful value of n to predict the next word using n-grams. Therefore, if the last three words of the inputted text are a relatively common word combination in the corpus, it will predict the word that most commonly follows those three, but if the last three words aren't a common combination, it will "back off" to make a prediction using the last two words of the inputted text. Smoothing is a way of dealing with combinations of words that either don't appear or don't appear often in the corpus.
In order to improve the performance of my model I plan to prune. This means that we will not keep track of n-grams that are very rare in the corpus.
Other things I will consider doing: