This document captures some interesting elements of an initial exploratory analysis of the data for project capstone.
Word count and character histograms are examined for three corpora: tweets, blogs and news. Marked differences in the charcter and the word differences are found between the different corpora.
It uses the {tm} package in R to look at word frequencies and associations.
I also look at the application of Zipf’s Law to the texts with some interesting findings about the relative importance of document type and word length in deviations from Zipf’s law.
The data collected for this project are a text corpora compiled here by Hans Christensen. Facts and some statistics about the files are shown here.
The texts studied are English Tweets, Blogs, and News articles. I also look briefly at a corpus in german to look at the application of Zipf’s Law
Since the data sets are huge, for this anlaysis I look at just a small randomly selected sample of each (comprising between 0.5% and 1.3%) of the totals. This sample size proves more than adequate for the prupose of this analysis and the reduction improves execution speed dramatically.
The length of the text as imported is 24000 versus the reported length of the original of 2,360,000 lines.
The original text has 24000 lines. After removing lines with offensive words it has 23368 lines representing a 2.6% reduction.
The data set contains 300030 words. For these lines of text:
- maximum number of characters is 140
- minimum number of characters is 5
- median number of characters is 64
- the mean number of characters is 68.66.
A histogram of the data showing a complex character length distribution with an (expected) sharp cut-off at 140 characters.
The distribution of word-counts shows a broad distribution with a peak near 8 words.
Zipf’s Law states that word frequency is inverserly related to its rank. We expect that the frequency of a word plotted against its rank, when plotted on a log scale, will fall on a line of slope -1.
The tm package has a ready build function to plot this.
Zipf_plot(tdm, "l")
## (Intercept) x
## 10.57519 -1.08095
We can see the calculated slope is indeed very near to -1.0. However the plot contains just lines and is a little unsatisfactory. So let’s dig a little deeper.
We can use the Term-Density-Matrix as a starting point and directly calculate word frequencies and rank.
tdm_matrix<-as.matrix(tdm)
word_Sums <- rowSums(tdm_matrix)
word_Sums <- sort(word_Sums, decreasing=TRUE)
Plotting the log-frequency against log-rank should follow, according to Zipf’s Law, an inverse line with a slope of -1. We can see from the graph below this is roughly the case, though a slight “bow” in the curve is clearly seen.
We can examing the “bow” trend a little more closely by subracting the “straight line” from the data.
As a hypothesis, it may be that the highly restricted character length of tweets leads to choices of shorter words, and somehow altering this behavior. In this case word length has used as a factor in separating the plots.
The curves show that while longer words are used less frequently than shorter words, as would be expected, but they all follow the same curve to a large degree. So at least to the precision of this visual analysis, deviaitions from Zipf’s Law in tweets does not arise from the choices of shorter words alone, since shorter and longer word appear to follow the same curve.
Here we look at the ratios of the words “love” and “hate”.
We know that the words love and hate exist in about a 3.86:1 ratio per earlier exploration. Another question we might ask is how richly they are used in association with other words. Here are the top associated words in Twitter.
The top words associated with love are:
## love word
## 1 0.09 clumsy
## 2 0.09 dsnt
## 3 0.09 mre
## 4 0.09 thumbsy
## 5 0.09 wpres
The top words associated with hate are:
## hate word
## 1 0.22 nixon
## 2 0.10 discriminate
## 3 0.08 blacks
## 4 0.08 destroy
## 5 0.08 richard
We can do a similar analysis for the News Text.
The original text has 24000 lines and with offensive texts reomved has 23905 lines for a 0.4% reduction.
The data set contains 23905 lines of text and 806646 words. For these lines of text:
- maximum number of characters is 8949
- minimum number of characters is 2
- median number of characters is 183
- the mean number of characters is 199.47.
The distribution of word-number has a reasonably well behaved distribution.
Note the log scale of the x-axis.
The ratios of the words “love” and “hate” in teh text is found with a couple of grepls.
love_test<-grepl("love", text)
hate_test<-grepl("hate", text)
The love hate ratio is the number of times love is mentioned versus hate. In this case the ratio is 3.42 : 1.
The top words associated with love are:
## love word
## 1 0.23 castellino
## 2 0.23 doowop
## 3 0.23 harmonies
## 4 0.23 holdridge
## 5 0.23 runaround
The top words associated with hate are:
## hate word
## 1 0.26 allchina
## 2 0.26 befell
## 3 0.26 bill
## 4 0.26 cabrera\u0092s
## 5 0.26 crucified
It’s interesting these word associations are different than for tweets. Whereas “hate” seems to have a much richer association in tweets, “love” appears to have the richer assocation in this text, pointing to stylistic differences.
Blog text is analyzed in the same way…
The original text has 23905 lines and with offensive texts removed has 23905 lines for a 0% reduction.
The data set contains and 1236918 words. For these lines of text:
- maximum number of characters is 4115
- minimum number of characters is 2
- median number of characters is 153.5
- the mean number of characters is 228.32.
Here is a histogram of the data showing the distribution of word counts in the blogs. The large number of very short blogs (fewer than 100 words) appears similar to the tweet distribution, suggesting the two corpora (blogs and tweets) may be somewhat confounded.
Again, looking at the frequency and usages of the words “love” and “hate”
The love hate ratio is the number of times love is mentioned versus hate. In this case the ratio is 4.35:1.
The top words associated with love are
## love word
## 1 0.22 unshakable
## 2 0.21 unconditional
## 3 0.20 nora
## 4 0.18 purest
## 5 0.15 and
The top words associated with hate are.
## hate word
## 1 0.43 telltale
## 2 0.37 phony
## 3 0.33 tread
## 4 0.26 dunno
## 5 0.26 whore
Some of the frequent 3 word terms in the blogs are:
## [1] "a bad thing" "a big deal" "a big part" "a bit and"
## [5] "a bit more" "a bit of" "a bit too" "a bottle of"
## [9] "a brand new" "a break from"
For the heck of it I thought I would look at some summary stats for German-laguage text, specifically to see how well Zipf’s Law holds.
The data set 327447 words.
A histogram of the data showing the distribution of word counts in the news text is similar to the english text.
The German text also appears to follow Zipf’s Law.
The Zipf’s law deviation analysis, as above, shows good conformance at the beginning and ends of the distribution, but with some deviation in the middle.
Tweets, Blogs, and News Text show distinct differences in the distributions of words, word structures, and word associations.
Zipf’s Law approximately holds for all corpora, though a more detailed analysis shows that the deviation is text specific.
My plans for the shiny app are to tokenize random samples of the News Corpus. It appears to require the least filtering and contains the fewest “non-words.” So offers teh quickest path to the solution.
Once I have these toeknized I will use a markov prediction model based n 2-grams and 3-grams as predictors.