Background

This report represents the initial, exploratory analysis of three text data sets, one from blogs, another from news reports and a third from Twitter posts. All three data sets are a compilation of many authors, rather than from a single individual. Individuals tend to have patterns in writing unique to the individual based on where they lived, language, education, etc. Written text from many different individuals would not have these individual patterns. Text is considered the most unstructured data as words can be combined in an almost infinite number of ways and even single words can have multiple meanings. For example, “present” can mean a gift, a moment of time (now), a grammar structure (present tense), a particular place (present at the table).

For this analysis, the data sets were not combined. Each data set was quite large; therefore, a smaller training set of 2000 lines per data set was randomly chosen. The data was also “cleaned up” which involved eliminating punctuation, capital letters, numbers, foreign characters, blank lines and stop words, words which are very common words and add little value, such as “the”, “and”, “a”, etc. Some basic metrics about the three data sets below includes, the number of lines and words.

Count Unique Words and Rank Them

The question above is asking how many unique words are in the data set. Unique words make for interesting reading, but are more complicated to predict. One long term goal of this project is to be able to predict the next word, knowing the first two or three. Knowing how many unique words would be useful for developing a prediction algorithm. The three tables below represent unique words from blogs, news reports and Twitter tweets in order.

Blogs
Count
text n
time 177
people 130
day 105
life 97
love 94
world 68
don 67
family 64
home 62
god 54
News
Count
text n
people 81
time 73
city 70
school 58
home 57
million 54
center 51
percent 51
season 51
county 48
Twitter
Count
text n
love 94
rt 76
day 72
time 54
lol 51
follow 40
awesome 39
night 33
people 33
tonight 33

Does this data set have a long or short tail?

A long tail would mean that a sizable portion of the data set are unique words. How would you predict a word which only shows up once in 2000 lines? As part of the cleaning process, stop words (very common words like “the”, “and”, “an”) were removed. With very long tail data, one may reexamine eliminating stopwords as it possible that including them would improve the algorithm.

Long Tail - 90% of the data set are these number of words

Blogs - 90% of Words
x
7960
News - 90% of Words
x
7929
Twitter - 90% of Words
x
3649

Short Tail - 50% of the data set are these number of words

Blogs - 50% of Words
x
1350
News - 50% of Words
x
1430
Twitter - 50% of Words
x
685

Exploratory Sentiment Analysis

One interesting way to look at text is to try to understand what the text is trying to convey, is it positive or negative? The Bing lexicon does that. How positive or negative? Afinn ranks words from 5 (most positive) to -5 (most negative). Does it have big themes like, joy, fear, etc.? The NRC lexicon also then categorizes the sentiments into positive and negative too. In an effort to get a sense of these three data sets, positive and negative words were looked at. For these three histograms, the NRC lexicon was used.

Next Steps

The goal of this project is to predict the next word, knowing the two previous words. Instead of looking at individual words, the next step is to look at small units of words, two or three words, building n-grams. As part of that process, several new packages will be used including widyr, ggraph, and document-term-matrices.