Corpus Exploration

Corpus exploration

For this Capstone project, this document explains the exploration task for the data provided. This data is a collection of news, blogs and Twitter text in three different languages - English, German, Finnish and Russian. For the sake of this project, we will focus only on English data set. This document will explore the data sets to arrive an understanding of this corpus.

Exploration strategy

Most examples of exploring corpus consider all the documents in a directory as one ‘set’. However, I have taken the approach to treat each document (news, blogs and Twitter) as an individual corpus item. This decision is chiefly driven by the fact that, each of the data source is unique in terms of authorship and penmanship. For example, the Twitter feed is extremely informal in nature as it is personal. The limitation of number of characters to be input in a tweet also imposese a certain restriction on using the Twitter medium. On the other end of the spectrum is the news feed, where it is fair to asssume that, the text will be formal, grammatically accurate and produced by staff trained in this field. A blog entry is probably mid-way in terms of authorship and formal nature of the text.

It is quite certain that, this strategy will have ramification in the later stages of this project.

To begin with, for each corpus (Twitter, news and blog), we will start with basic statistics. Then, we will look for some features that can give us a hint of what were some of the human elements that the text was referring to. For example, what does the presence of various smiley icons in the Twitter corpus imply? Overly optimistic? Gloomy? Or, does the news corpus call out a country more often than others? How about celebrities? Are the blogs referring to certain places of international significance more than others?

Exploration code

The strategy behing developing the code for this exploration is based on three simple steps. Looking at the code, you could make out that, if you wanted to extend this analysis, you have to just arrive at a list of words that could be plugged in as search strings to get the counts.

Take a data set related to a human element.
Look up the entries from that data set in the document.
Collect the counts for each entry.

Finally, the counts will be plotted to understand the trends.

The corpus used were as follows :

Smileys - http://www.webopedia.com/quick_ref/textmessageabbreviations_02.asp
Famous names - http://www.kalmasoft.com/KLEX/dbfamnm.htm
Famous places - http://www.kalmasoft.com/KLEX/dbtopo.htm

Twitter data set

First, we load the Twitter data set for English language. Then, we look for word counts for words that represent smileys, names of months, places and celebrities.

library(NLP)
library(tm)
#
setwd("~/R/CAPS")
source('scripts/monthNames.R')
source('scripts/smileys.R')
source('scripts/famousNames.R')
source('scripts/famousPlaces.R')
#
twitter <- Corpus(URISource('file://final/en_US/en_US.twitter.txt'))
tweets <- twitter[[1]][1][[1]]
t <- tweets
#
smileyTweets <- sapply(smileys, function(i)
  grep(i, t))
smileyTweetsCounts <- sapply(smileyTweets, length)
#
monthTweets <- sapply(monthNames, function(i)
  grep(i, t))
monthNamesCounts <- sapply(monthTweets, length)
#
famousNamesTweets <- sapply(famousNames, function(i)
  grep(i,t,ignore.case = TRUE))
famousNamesTweetsCounts <- sapply(famousNamesTweets, length)
#
famousPlacesTweets <- sapply(famousPlaces, function(i)
  grep(i,t,ignore.case = TRUE))
famousPlacesTweetsCounts <- sapply(famousPlacesTweets, length)
#

Smileys

Let us plot the counts to see the ‘mood’ of the ‘Twitteratti’.

barplot(smileyTweetsCounts,
        main='Mood - Twitter',
        xlab='Emotion',
        ylab='Occurence',
        las=2)

Twitter mood

A very happy community with winks and confusion being distant second emotions. Even with the fact that, not all symbols were used or those symbols could have been inserted in right-to-left format too, the number of times the smile symbol appears far outweighs others.

Month names

Was there a reference to any particular month of the year by the community?

barplot(monthNamesCounts,
        main='Months - Tweets',
        xlab='Month',
        ylab='Occurence',
        las=2)

Months

For some reason, the month of March appears with highest frequency. Could it be because the word mar has a different meaning? Just as in the case of the month May?

Famous names

Who were the celebrities whose names appear most commonly?

f <- famousNamesTweetsCounts[famousNamesTweetsCounts>40]
barplot(f,
        main='Celebrity names - Tweets',
        xlab='Name',
        ylab='Occurence',
        las=2)

Famous Names

Martin Luther appears prominently with Donald Trump at par with Tom Hanks and Bill Gates! Maybe, the time-line of this data set was related to an event around Martin Luther?

Famous places

What were the places whose names appear most commonly?

f <- famousPlacesTweetsCounts[famousPlacesTweetsCounts>40]
barplot(f,
        main='Places - Tweets',
        xlab='Place',
        ylab='Occurence',
        las=2)

Famous Names

New York City and The White House feature prominently.

News dataset

Let us explore the news dataset.

library(NLP)
library(tm)
#
setwd("~/R/CAPS")
source('scripts/monthNames.R')
source('scripts/smileys.R')
source('scripts/famousNames.R')
source('scripts/famousPlaces.R')
#
news <- Corpus(URISource('file://final/en_US/en_US.news.txt'))
newsItems <- news[[1]][1][[1]]
t <- newsItems
#
smileynewsItems <- sapply(smileys, function(i)
  grep(i, t))
smileyNewsCounts <- sapply(smileynewsItems, length)
#
monthnewsItems <- sapply(monthNames, function(i)
  grep(i, t))
monthNamesNewsCounts <- sapply(monthnewsItems, length)
#
famousNamesnewsItems <- sapply(famousNames, function(i)
  grep(i,t,ignore.case = TRUE))
famousNamesnewsItemsCounts <- sapply(famousNamesnewsItems, length)
#
famousPlacesnewsItems <- sapply(famousPlaces, function(i)
  grep(i,t,ignore.case = TRUE))
famousPlacesnewsItemsCounts <- sapply(famousPlacesnewsItems, length)

Smileys

Let us plot the counts to see the ‘mood’ of the news items.

barplot(smileyNewsCounts,
        main='Mood - news',
        xlab='Emotion',
        ylab='Occurence',
        las=2)

News mood

Given that, news items are formal pieces, it is not a surprise to see how the smileys do not appear very frequently. In fact, the spike in the count for Confused icon might be a ‘red herring’.

Month names

Was there a reference to any particular month of the year by the community?

barplot(monthNamesNewsCounts,
        main='Month Reference - News',
        xlab='Month',
        ylab='Occurence',
        las=2)

Months

For some reason, the month of March appears with highest frequency. Could it be because the word mar has a different meaning? Just as in the case of the month May?

Famous names

Who were the celebrities whose names appear most commonly?

f <- famousNamesnewsItemsCounts[famousNamesnewsItemsCounts>40]
barplot(f,
        main='Celebrity names - News',
        xlab='Name',
        ylab='Occurence',
        las=2)

Famous Names

As in case of Twitter, Martin Luther shows the highest frequency of occurrence. This is probably got to do with the time line of sourcing this data set.

Famous places

What were the places whose names appear most commonly?

f <- famousPlacesnewsItemsCounts[famousPlacesnewsItemsCounts>40]
barplot(f,
        main='Places names - News',
        xlab='Place',
        ylab='Occurence',
        las=2)

Famous Names

The places of highest frequency are New York City and interestingly The White House too.

Blogs

Let us explore the blogs dataset.

library(NLP)
library(tm)
#
setwd("~/R/CAPS")
source('scripts/monthNames.R')
source('scripts/smileys.R')
source('scripts/famousNames.R')
source('scripts/famousPlaces.R')
#
blogs <- Corpus(URISource('file://final/en_US/en_US.blogs.txt'))
blogsItems <- blogs[[1]][1][[1]]
t <- blogsItems
#
smileyblogsItems <- sapply(smileys, function(i)
  grep(i, t))
smileyBlogsCounts <- sapply(smileyblogsItems, length)
#
monthblogsItems <- sapply(monthNames, function(i)
  grep(i, t))
monthNamesBlogsCounts <- sapply(monthblogsItems, length)
#
famousNamesBlogsItems <- sapply(famousNames, function(i)
  grep(i,t,ignore.case = TRUE))
famousNamesBlogsItemsCounts <- sapply(famousNamesBlogsItems, length)
#
famousPlacesblogsItems <- sapply(famousPlaces, function(i)
  grep(i,t,ignore.case = TRUE))
famousPlacesblogsItemsCounts <- sapply(famousPlacesblogsItems, length)

Smileys

Let us plot the counts to see the ‘mood’ of the blogs items.

barplot(smileyBlogsCounts,
        main='Blog mood',
        xlab='Emotion',
        ylab='Occurence',
        las=2)

Blogs mood

As mentioned earlier in this document, a blog is in between the formality of tweets and news items. Therefore, the count of smile icons is greater than in the news item but, it is the count of Confused icon that is the highest here.

Month names

Was there a reference to any particular month of the year by the community?

barplot(monthNamesBlogsCounts,
        main='Blog Month Reference',
        xlab='Month',
        ylab='Occurence',
        las=2)

Months

For some reason, the month of March appears with highest frequency. Could it be because the word mar has a different meaning? Just as in the case of the month May?

Famous names

Who were the celebrities whose names appear most commonly?

f <- famousNamesBlogsItemsCounts[famousNamesBlogsItemsCounts>40]
barplot(f,
        main='Celebrity names - Blogs',
        xlab='Name',
        ylab='Occurence',
        las=2)

Famous Names

Once again, Martin Luther King has the highest score for the famous names in blogs.

Famous places

What were the places whose names appear most commonly?

f <- famousPlacesblogsItemsCounts[famousPlacesblogsItemsCounts>40]
barplot(f,
        main='Places names - Blogs',
        xlab='Place',
        ylab='Occurence',
        las=2)

Famous Names

The places of interest are New York followed by Singapore!

Summary of analysis

The occurrence of the smiley icons is consistently high in all data sets. Clearly, there must be a n-gram scenario that can result in appending a smiley icon.

Tokenization Modeling

Given that, the use of smileys for emotion appears a number of times, can I build an algorithm to predict the emoticon based on a given sequence of words? For example, if I replace the occuerence of the Smiley icon with an identifiable text format such as |smiley_icon|, and then apply tokenization, can I get n-grams suitable for prediction?

Corpus Exploration

Nagesh Subrahmanyam

4 September 2016