Exploratory Analysis of Three Text Data Sets

Background

This report represents the initial, exploratory analysis of three text data sets, one from blogs, another from news reports and a third from Twitter posts. All three data sets are a compilation of many authors, rather than from a single individual. Individuals tend to have patterns in writing unique to the individual based on where they lived, language, education, etc. Written text from many different individuals would not have these individual patterns. Text is considered the most unstructured data as words can be combined in an almost infinite number of ways and even single words can have multiple meanings. For example, “present” can mean a gift, a moment of time (now), a grammar structure (present tense), a particular place (present at the table).

For this analysis, the data sets were not combined. Each data set was quite large; therefore, a smaller training set of 2000 lines per data set was randomly chosen. The data was also “cleaned up” which involved eliminating punctuation, capital letters, numbers, foreign characters, blank lines and stop words, words which are very common words and add little value, such as “the”, “and”, “a”, etc. Some basic metrics about the three data sets below includes, the number of lines and words.

Count Unique Words and Rank Them

The question above is asking how many unique words are in the data set. Unique words make for interesting reading, but are more complicated to predict. One long term goal of this project is to be able to predict the next word, knowing the first two or three. Knowing how many unique words would be useful for developing a prediction algorithm. The three tables below represent unique words from blogs, news reports and Twitter tweets in order.

Blogs	Count
text	n
time	177
people	130
day	105
life	97
love	94
world	68
don	67
family	64
home	62
god	54

News	Count
text	n
people	81
time	73
city	70
school	58
home	57
million	54
center	51
percent	51
season	51
county	48

Twitter	Count
text	n
love	94
rt	76
day	72
time	54
lol	51
follow	40
awesome	39
night	33
people	33
tonight	33

Does this data set have a long or short tail?

A long tail would mean that a sizable portion of the data set are unique words. How would you predict a word which only shows up once in 2000 lines? As part of the cleaning process, stop words (very common words like “the”, “and”, “an”) were removed. With very long tail data, one may reexamine eliminating stopwords as it possible that including them would improve the algorithm.

Long Tail - 90% of the data set are these number of words

Blogs - 90% of Words
x
7960

News - 90% of Words
x
7929

Twitter - 90% of Words
x
3649

Short Tail - 50% of the data set are these number of words

Blogs - 50% of Words
x
1350

News - 50% of Words
x
1430

Twitter - 50% of Words
x
685

Exploratory Sentiment Analysis

One interesting way to look at text is to try to understand what the text is trying to convey, is it positive or negative? The Bing lexicon does that. How positive or negative? Afinn ranks words from 5 (most positive) to -5 (most negative). Does it have big themes like, joy, fear, etc.? The NRC lexicon also then categorizes the sentiments into positive and negative too. In an effort to get a sense of these three data sets, positive and negative words were looked at. For these three histograms, the NRC lexicon was used.

Next Steps

The goal of this project is to predict the next word, knowing the two previous words. Instead of looking at individual words, the next step is to look at small units of words, two or three words, building n-grams. As part of that process, several new packages will be used including widyr, ggraph, and document-term-matrices.