For this Capstone project, this document explains the exploration task for the data provided. This data is a collection of news, blogs and Twitter text in three different languages - English, German, Finnish and Russian. For the sake of this project, we will focus only on English data set. This document will explore the data sets to arrive an understanding of this corpus.
Most examples of exploring corpus consider all the documents in a directory as one ‘set’. However, I have taken the approach to treat each document (news, blogs and Twitter) as an individual corpus item. This decision is chiefly driven by the fact that, each of the data source is unique in terms of authorship and penmanship. For example, the Twitter feed is extremely informal in nature as it is personal. The limitation of number of characters to be input in a tweet also imposese a certain restriction on using the Twitter medium. On the other end of the spectrum is the news feed, where it is fair to asssume that, the text will be formal, grammatically accurate and produced by staff trained in this field. A blog entry is probably mid-way in terms of authorship and formal nature of the text.
It is quite certain that, this strategy will have ramification in the later stages of this project.
To begin with, for each corpus (Twitter, news and blog), we will start with basic statistics. Then, we will look for some features that can give us a hint of what were some of the human elements that the text was referring to. For example, what does the presence of various smiley icons in the Twitter corpus imply? Overly optimistic? Gloomy? Or, does the news corpus call out a country more often than others? How about celebrities? Are the blogs referring to certain places of international significance more than others?
The strategy behing developing the code for this exploration is based on three simple steps. Looking at the code, you could make out that, if you wanted to extend this analysis, you have to just arrive at a list of words that could be plugged in as search strings to get the counts.
Finally, the counts will be plotted to understand the trends.
The corpus used were as follows :
First, we load the Twitter data set for English language. Then, we look for word counts for words that represent smileys, names of months, places and celebrities.
library(NLP)
library(tm)
#
setwd("~/R/CAPS")
source('scripts/monthNames.R')
source('scripts/smileys.R')
source('scripts/famousNames.R')
source('scripts/famousPlaces.R')
#
twitter <- Corpus(URISource('file://final/en_US/en_US.twitter.txt'))
tweets <- twitter[[1]][1][[1]]
t <- tweets
#
smileyTweets <- sapply(smileys, function(i)
grep(i, t))
smileyTweetsCounts <- sapply(smileyTweets, length)
#
monthTweets <- sapply(monthNames, function(i)
grep(i, t))
monthNamesCounts <- sapply(monthTweets, length)
#
famousNamesTweets <- sapply(famousNames, function(i)
grep(i,t,ignore.case = TRUE))
famousNamesTweetsCounts <- sapply(famousNamesTweets, length)
#
famousPlacesTweets <- sapply(famousPlaces, function(i)
grep(i,t,ignore.case = TRUE))
famousPlacesTweetsCounts <- sapply(famousPlacesTweets, length)
#
Let us plot the counts to see the ‘mood’ of the ‘Twitteratti’.
barplot(smileyTweetsCounts,
main='Mood - Twitter',
xlab='Emotion',
ylab='Occurence',
las=2)
Twitter mood
A very happy community with winks and confusion being distant second emotions. Even with the fact that, not all symbols were used or those symbols could have been inserted in right-to-left format too, the number of times the smile symbol appears far outweighs others.
Was there a reference to any particular month of the year by the community?
barplot(monthNamesCounts,
main='Months - Tweets',
xlab='Month',
ylab='Occurence',
las=2)
Months
For some reason, the month of March appears with highest frequency. Could it be because the word mar has a different meaning? Just as in the case of the month May?
Who were the celebrities whose names appear most commonly?
f <- famousNamesTweetsCounts[famousNamesTweetsCounts>40]
barplot(f,
main='Celebrity names - Tweets',
xlab='Name',
ylab='Occurence',
las=2)
Famous Names
Martin Luther appears prominently with Donald Trump at par with Tom Hanks and Bill Gates! Maybe, the time-line of this data set was related to an event around Martin Luther?
What were the places whose names appear most commonly?
f <- famousPlacesTweetsCounts[famousPlacesTweetsCounts>40]
barplot(f,
main='Places - Tweets',
xlab='Place',
ylab='Occurence',
las=2)
Famous Names
New York City and The White House feature prominently.
Let us explore the news dataset.
library(NLP)
library(tm)
#
setwd("~/R/CAPS")
source('scripts/monthNames.R')
source('scripts/smileys.R')
source('scripts/famousNames.R')
source('scripts/famousPlaces.R')
#
news <- Corpus(URISource('file://final/en_US/en_US.news.txt'))
newsItems <- news[[1]][1][[1]]
t <- newsItems
#
smileynewsItems <- sapply(smileys, function(i)
grep(i, t))
smileyNewsCounts <- sapply(smileynewsItems, length)
#
monthnewsItems <- sapply(monthNames, function(i)
grep(i, t))
monthNamesNewsCounts <- sapply(monthnewsItems, length)
#
famousNamesnewsItems <- sapply(famousNames, function(i)
grep(i,t,ignore.case = TRUE))
famousNamesnewsItemsCounts <- sapply(famousNamesnewsItems, length)
#
famousPlacesnewsItems <- sapply(famousPlaces, function(i)
grep(i,t,ignore.case = TRUE))
famousPlacesnewsItemsCounts <- sapply(famousPlacesnewsItems, length)
Let us plot the counts to see the ‘mood’ of the news items.
barplot(smileyNewsCounts,
main='Mood - news',
xlab='Emotion',
ylab='Occurence',
las=2)
News mood
Given that, news items are formal pieces, it is not a surprise to see how the smileys do not appear very frequently. In fact, the spike in the count for Confused icon might be a ‘red herring’.
Was there a reference to any particular month of the year by the community?
barplot(monthNamesNewsCounts,
main='Month Reference - News',
xlab='Month',
ylab='Occurence',
las=2)
Months
For some reason, the month of March appears with highest frequency. Could it be because the word mar has a different meaning? Just as in the case of the month May?
Who were the celebrities whose names appear most commonly?
f <- famousNamesnewsItemsCounts[famousNamesnewsItemsCounts>40]
barplot(f,
main='Celebrity names - News',
xlab='Name',
ylab='Occurence',
las=2)
Famous Names
As in case of Twitter, Martin Luther shows the highest frequency of occurrence. This is probably got to do with the time line of sourcing this data set.
What were the places whose names appear most commonly?
f <- famousPlacesnewsItemsCounts[famousPlacesnewsItemsCounts>40]
barplot(f,
main='Places names - News',
xlab='Place',
ylab='Occurence',
las=2)
Famous Names
The places of highest frequency are New York City and interestingly The White House too.
Let us explore the blogs dataset.
library(NLP)
library(tm)
#
setwd("~/R/CAPS")
source('scripts/monthNames.R')
source('scripts/smileys.R')
source('scripts/famousNames.R')
source('scripts/famousPlaces.R')
#
blogs <- Corpus(URISource('file://final/en_US/en_US.blogs.txt'))
blogsItems <- blogs[[1]][1][[1]]
t <- blogsItems
#
smileyblogsItems <- sapply(smileys, function(i)
grep(i, t))
smileyBlogsCounts <- sapply(smileyblogsItems, length)
#
monthblogsItems <- sapply(monthNames, function(i)
grep(i, t))
monthNamesBlogsCounts <- sapply(monthblogsItems, length)
#
famousNamesBlogsItems <- sapply(famousNames, function(i)
grep(i,t,ignore.case = TRUE))
famousNamesBlogsItemsCounts <- sapply(famousNamesBlogsItems, length)
#
famousPlacesblogsItems <- sapply(famousPlaces, function(i)
grep(i,t,ignore.case = TRUE))
famousPlacesblogsItemsCounts <- sapply(famousPlacesblogsItems, length)
Let us plot the counts to see the ‘mood’ of the blogs items.
barplot(smileyBlogsCounts,
main='Blog mood',
xlab='Emotion',
ylab='Occurence',
las=2)
Blogs mood
As mentioned earlier in this document, a blog is in between the formality of tweets and news items. Therefore, the count of smile icons is greater than in the news item but, it is the count of Confused icon that is the highest here.
Was there a reference to any particular month of the year by the community?
barplot(monthNamesBlogsCounts,
main='Blog Month Reference',
xlab='Month',
ylab='Occurence',
las=2)
Months
For some reason, the month of March appears with highest frequency. Could it be because the word mar has a different meaning? Just as in the case of the month May?
Who were the celebrities whose names appear most commonly?
f <- famousNamesBlogsItemsCounts[famousNamesBlogsItemsCounts>40]
barplot(f,
main='Celebrity names - Blogs',
xlab='Name',
ylab='Occurence',
las=2)
Famous Names
Once again, Martin Luther King has the highest score for the famous names in blogs.
What were the places whose names appear most commonly?
f <- famousPlacesblogsItemsCounts[famousPlacesblogsItemsCounts>40]
barplot(f,
main='Places names - Blogs',
xlab='Place',
ylab='Occurence',
las=2)
Famous Names
The places of interest are New York followed by Singapore!
The occurrence of the smiley icons is consistently high in all data sets. Clearly, there must be a n-gram scenario that can result in appending a smiley icon.
Given that, the use of smileys for emotion appears a number of times, can I build an algorithm to predict the emoticon based on a given sequence of words? For example, if I replace the occuerence of the Smiley icon with an identifiable text format such as |smiley_icon|, and then apply tokenization, can I get n-grams suitable for prediction?