A lot of material and code in this work is based on the tidy-text-mining book available online: https://www.tidytextmining.com/.
We are given three large text files collected from the blogs, twitter and news. The idea is to learn enough useful information regarding the english words and their combinations in order to eventually be able to make predictions regarding next words that the user of a word prediction app might want to type in following some initially typed words and/or phrases.
First, below are some basic stats regarding the three datasets. These include total word and line counts in the three raw text files.
## # A tibble: 3 x 3
## source line_count word_count
## <chr> <int> <int>
## 1 blogs 899288 38154238
## 2 news 1010242 35010785
## 3 twitter 2360148 30218166
From the table and the charts above we can see that the total line count is dominated by the twitter feed, with total number of tweets being close to 2.5 million. Number of lines found in blogs and news files is near 1 million, with the news file having slighly more (1.01 million) and the blogs file having slightly less (0.9).
When it comes to the word count, twitter data actually has the least number of total words, standing at marginally more than 30 million words, superseeded by the news data, with about 35 million words and blogs data having about 38 million words. Twitter word count shouldn’t be much of a surprise as the maximum tweet size has historically been set to 140 characters and since 2017 it’s been expanded to 280.
Raw data contains lots of “junk” - characters from foreign languages, symbols, URLs, typos etc. A nice and simple initial data filtering approach is to only include word tokens that have english letters in them, are between 3 and 15 characters long and are not included in the stop_words word list found in the tidytext package.
After the aforemntioned steps of filtering the words the picture of word counts has changed quite a bit. The overall number of words has decreased - a lot of which technically weren’t words and very likely were stop words that were not carrying a lot of meaning. News data source now carrier the most of useful information.
Next step is to look at the word occurence frequency across the three data sources. Two main metrics that can identify most frequently occuring words are term frequency (TF) - number of occurencies of a word, divided by the count of all of the words and the term frequency-inverse document frequency (TF-IDF).
\(tf(term)=\frac{n_{termOccurences}}{n_{totalTermCount}}\)
Computation of the TF-IDF is slightly more convoluted than that of the TF, and involves taking a natural log of the ratio of the number of documents found in a data source and dividing that number by the number of the documents that contain that particular term.
\(idf(term)=\ln(\frac{n_{docs}}{n_{docsWithTerm}})\)
TF-IDF computation is then straightforward and is simply a product of the term’s TF and IDF
\(tfidf(term)=tf(term)*idf(term)\)
Chart below shows the most frequent words found across the three data sets as identified by the TF metric.
We can see that the two data sources content of which originates from various blogs and tweets actually have words such as love, time and people being some of the most popular ones.
Now if we construct similar charts for the words that rank in the top 12 for the TF-IDF metric we are going to see a “bit” of a different picture.
The nature of TF-IDF is to identify terms that are unique to a given entry (line in this case) of a corpus (here it’s data source). This means that there are some really hateful entries or comments present in the blogs and the tweets, which will potentially need to be removed when creating the word prediction model. It wouldn’t be a great idea to suggest to people how to insult others…
Finally, and quite importantly, let’s look at relationships between words and how likely is it that a particular word is followed by another particular word.
To do so, let’s first look at the most frequent bigrams identified by the TF metric.
From the chart above we can see that the most frequent bigrams involve words that are definitely not insults, and make quite a bit of sense for any given data source. The news data has political, policy and sports themes to it. Blogs also touch upon some of the policies, but also go into religion a bit. Twitter has quite a bit of congratulatory connotation to it as well as a couple of pop-culture-related bigrams.
Similar to individual words, we can apply the TF-IDF metric to word bigrams.
The good news is that there is pretty much no profanity involved in the top TF-IDF bigrams. Bigrams identified in the chart above have something unique about them and might be worth exploring in more detail.
As a little extra credit, let’s visualize relationships between words, in this case bigrams, as graphs, to see which words tend to follow which words and perhaps get an idea of how the word prediction model can be constructed.
Graph above shows which words are more likely to have connections. The simplest word prediction model might take a word as an input and suggest the most likely next word based on a graph model similar to one shown above. If the word wasn’t found, it’s part of speech can be determined and followed up with some generic word that is appropriate based on the english grammar rules.
This milestone report performs some of the necessary exploratory data analysis of the three textual data sources provided for the Capstone Project. Some of the additional exploratory analyses that would be nice to perform on the data in the future are sentiment analysis and topic modeling.