Synopsys

We performed an exploratory data analysis trying to understand the similarities and differences in three English corpuses (twitter, news, and blog), trying to answer questions such as “Should we combine the three of them to build a predictive model of the next word?” or: Should we use only one of them, assuming that our application will be more oriented to informal text input.

In the following sections, we will see that among the different corpuses (twitter, news and blogs), unigrams and trigrams have some similarities in very common expressions, but also they have considerable differences in the different corpuses. Particularly, twitter users post in a more informal style and using more frequently the first and second person. On the contrary, news contain more formal sentences.

Exploratory Analysis

To begin with, let’s load some libaries that aid with the analysis. We will use Quanteda for building the corpuses and tokenize the ngrams.

The data

The data for this capstone has been obtained from HC corpora, and consists of a collection of publicly available sources collected by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. This capstone project involves working with 12 data files, 3 set of files for four different languajes: German, English, Finnish, and Norwegian (de_DE, en_US, fi_FI, ru_RU respectively). Our exploratory analysis will be performed using only the English files. These are as follows:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Then we read the files using the readLines function and some cleaning is applied in the case of the blog entries.

When loading the data, we took some samples in order to work more quickly. We use a seedso the results are reproducible.

10% randmom sample from the twitter file. This sample contains 236014 tweets.
50% randmom sample from the news file. This sample contains 505121 news.
10% randmom sample from blog entries. It consists on 89928 blog entries

Lets see a summary of the lengths of the diferent texts

summary(str_length(blog)) %>% pander

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
0	45	151	224.3	320	13930

summary(str_length(news)) %>% pander

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
0	106	180	196	261	7586

summary(str_length(twitter)) %>% pander

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
2	35	61	65.87	96	145

We can see above that within blog entries it’s much more likely to find long entries than in news and of course twitter.

temp=as.data.frame(head(twitter[str_length(twitter)>170], 10))
print(xtable(temp),type = "html", comment=F, include.rownames=FALSE)

Creating corpus and a document-frequency matrix for each type of text (twitter, news, and blog )

We created three different corpuses, for the twitter, news and blog data sets respectively. We didn’t remove stop words and did not apply stem, because those elements can help us to predict the next words. Also we tokenized the corpuses and obtained unigrams and trigrams

Unigrams:

Bag of words visualization

Let’s display a bag of words visualization for Twitter, news and blogs respectively:

Some findings from these unigrams:

We can see some differences between the different type of corpus. For example, the words such as “you”, “your”, “i”, “i’am”, “me”, and “my” are more likely to occur in a tweet than in a blog post and much less likely to occur in a news entry. That makes sense since the twitter style is more informal, whereas usually the news are presented in a more formal language and using the third person.
In the news corpus we can found some top words that are verbs in the past form, such as “said”, “has”, “was”.
In the twitter corpus the word “just” is one of the most frequent, but it doesn’t appear as one of the top ones in the blog and the news corpus either.

Relationship between some top words

Now, we will select some frequent unigrams and compare how they relate with other words using the similarity function provided by Quanteda.

## i'm :

## said :

## love :

We can see that there are some important differences for the relationships of some common words such as “i’m”, “love”, and “said” depending on the style of corpus

To finalize our exploratory analysis let’s analyze some top trigrams accross all corpuses and then try to see how different they are regards predicting the next word.

Trigrams:

## Twitter top trigrams:

##     thanks_for_the      thank_you_for looking_forward_to 
##               2376                879                873 
##         i_love_you     for_the_follow          i_want_to 
##                812                805                736 
##        going_to_be      can't_wait_to          i_need_to 
##                715                708                592 
##           i_have_a 
##                591

## News top trigrams:

##       one_of_the         a_lot_of       as_well_as       out_of_the 
##             7394             5830             3193             2824 
##      part_of_the according_to_the       the_end_of      some_of_the 
##             2813             2813             2791             2715 
##          to_be_a      going_to_be 
##             2686             2629

## Blogs top trigrams:

##  one_of_the    a_lot_of    it_was_a     to_be_a  as_well_as  out_of_the 
##        1512        1291         697         673         671         652 
## some_of_the  be_able_to   i_want_to  the_end_of 
##         644         639         632         622

We can see here that some of the trigrams have an important frequency in the three corpuses. Examples of this are: “one_of_the” and “a_lot_of”. In addition we can say that the blog and news corpus are closer each other in terms of trigrams. We can see again that the twitter style is a lot more informal and the use of the first and second person is more frequent. Let’s see a couple of these trigrams and look if they can produce different predictions depending on the style of the text.

Some differences in terms of predicting the next word

In this section we will try we will explore some of these trigrams, and see the likelihood of the third word given the rest. For example, for the trigram “i love you” we will see what is the estimated probability of predicting “you” given that the two previous words are “i love”, and compare that probability with other trigrams whose third word is “you” and among the three different corpuses.

Future Steps

Our next steps here are to take these ngram we have computed, along with bigrams and perhaps four-gram and build a model that best predicts an unseen word given some text. The way we are implementig this is by building matrices of our unigrams, bigrams and trigrams storing the frequency of each value in the train set. Then build an algorithm that is capable to assign some probability to unseen events, using one of the smoothing methods available or stupid backoff.

As we saw in our exploratory analysis, a twitter corpus can produce different predictions than a corpus that contains only news. In this way, we will assume that our predictive model will be consumed by a twitter-like application and thus we would use only the twitter corpus.

Since the application will live as a shiny application that has some constraints in terms of memory and processor resources, we will have to build our algorithm having those elements in mind.

Apendix

Code

#install.packages("needs")
library(needs)
needs(dplyr, tidyr, stringr, lubridate, readr, ggplot2,
      MASS,
      pander, formattable, viridis, 
      SnowballC, tm, wordcloud, RColorBrewer, RWeka, xtable, quanteda, gridExtra)

# Opening the files
set.seed(1)

# Twitter:
con <- file("../input/en_US/en_US.twitter.txt", "r") 
data<-readLines(con, n=-1) ## Read the entire file
close(con) 

data<-gsub("â€™", "'", data)
data<-gsub("â€œ", "", data)
data<-gsub("â€", "", data)
data<-gsub("  ", " ", data)
data<-gsub("  ", " ", data)
data<-gsub("  ", " ", data)
data<-gsub(" & ", " and ", data)
data<-gsub(" an ", " and ", data)
data<-gsub("[^a-zA-Z0-9 '-]", "", data)

data=str_trim(data)

h<-sample(length(data), length(data)/10)
twitter=data[h]

# Blogs:
con <- file("../input/en_US/en_US.blogs.txt", "r") 
data<-readLines(con, n=-1) ## Read the entire file
close(con) 

data<-gsub("â€™", "'", data)
data<-gsub("â€œ", "", data)
data<-gsub("â€", "", data)
data<-gsub("  ", " ", data)
data<-gsub("  ", " ", data)
data<-gsub("  ", " ", data)
data<-gsub(" & ", " and ", data)
data<-gsub(" an ", " and ", data)
data<-gsub("[^a-zA-Z0-9 '-]", "", data)

data=str_trim(data)

h<-sample(length(data), length(data)/10)
blog=data[h]

#here we do some cleaning to the blogs data
blog<-gsub("â€™", "'", blog)
blog=str_trim(blog)

# News:
con <- file("../input/en_US/en_US.news.txt", "rb") 
data<-readLines(con, n=-1) ## Read the entire file
close(con) 

data<-gsub("â€™", "'", data)
data<-gsub("â€œ", "", data)
data<-gsub("â€", "", data)
data<-gsub("  ", " ", data)
data<-gsub("  ", " ", data)
data<-gsub("  ", " ", data)
data<-gsub(" & ", " and ", data)
data<-gsub(" an ", " and ", data)
data<-gsub("[^a-zA-Z0-9 '-]", "", data)

data=str_trim(data)

h<-sample(length(data), length(data)/2)
news=data[h]
data=NULL

corpusTw <- corpus(twitter)  # build the corpus
corpusNe <- corpus(news)  # build the corpus
corpusBl <- corpus(blog)  # build the corpus

#Constructing a document-frequency matrix for unigrams
dfmTw <- dfm(corpusTw, verbose=FALSE, removeTwitter=TRUE)
dfmNe <- dfm(corpusNe, verbose=FALSE, removeTwitter=TRUE)
dfmBl <- dfm(corpusBl, verbose=FALSE, removeTwitter=TRUE)

par(mfrow = c(1,3))

if (require(RColorBrewer))
    plot(dfmTw, max.words = 30, colors = brewer.pal(6, "Dark2"), scale = c(8, .5), main = "Twitter")
if (require(RColorBrewer))
    plot(dfmNe, max.words = 30, colors = brewer.pal(7, "Dark2"), scale = c(8, .5), main = "News")
if (require(RColorBrewer))
    plot(dfmBl, max.words = 30, colors = brewer.pal(8, "Dark2"), scale = c(8, .5), main = "Blog")

#List of words to see relationship:
wordsToSee=c("i'm", 'said', 'love')
simTw=similarity(dfmTw, selection=wordsToSee, n = NULL,
  margin = c("features"), method = "correlation", normalize = FALSE)
simNe=similarity(dfmNe, selection=wordsToSee, n = NULL,
  margin = c("features"), method = "correlation", normalize = FALSE)
simBl=similarity(dfmBl, selection=wordsToSee, n = NULL,
  margin = c("features"), method = "correlation", normalize = FALSE)

for(i in 1:length(wordsToSee)){
  word=wordsToSee[i]
  cat(word, ":\n")
  ##Twitter
  df=as.data.frame(head(simTw[[word]], 10));df$feature=names(head(simTw[[word]], 10))
  names(df)[1]<-'Cor'
  g1=ggplot(df, aes(x = reorder(feature, Cor), y = Cor)) +
  geom_bar(stat = "identity", fill="gray") + coord_flip() +
  ggtitle(paste0("Twitter. Top cor to '", word, "'"))
  
  ##News
  df=as.data.frame(head(simNe[[word]], 10));df$feature=names(head(simNe[[word]], 10))
  names(df)[1]<-'Cor'
  g2=ggplot(df, aes(x = reorder(feature, Cor), y = Cor)) +
  geom_bar(stat = "identity", fill="gray") + coord_flip() +
  ggtitle(paste0("News. Top cor to '", word, "'"))
  
  ##Blogs
  df=as.data.frame(head(simBl[[word]], 10));df$feature=names(head(simBl[[word]], 10))
  names(df)[1]<-'Cor'
  g3=ggplot(df, aes(x = reorder(feature, Cor), y = Cor)) +
  geom_bar(stat = "identity", fill="gray") + coord_flip() +
  ggtitle(paste0("Blog. Top cor to '", word, "'"))
  
  grid.arrange(g1, g2, g3, nrow=1, ncol=3)
}

Now we will see the top 10 trigrams for each type of corpus:

dfmTw <- dfm(corpusTw, ngrams = 3, removeTwitter=TRUE, verbose=FALSE)
dfmNe <- dfm(corpusNe, ngrams = 3, removeTwitter=TRUE, verbose=FALSE)
dfmBl <- dfm(corpusBl, ngrams = 3, removeTwitter=TRUE, verbose=FALSE)

cat("Twitter top trigrams:\n")
topfeatures(dfmTw, 10) 
cat("News top trigrams:\n")
topfeatures(dfmNe, 10) 
cat("Blogs top trigrams:\n")
topfeatures(dfmBl, 10)

#for_the_follow
#twitter. for_the_follow
h=(grepl('^for\\_the\\_', colnames(dfmTw)) & colSums(dfmTw)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmTw)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmTw)[h], decreasing=T)),10)
df$name=rownames(df)

g1=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Predicting 'for the'")) + labs(y="Probability",x="Trigram")

#News. for_the_follow
h=(grepl('^for\\_the\\_', colnames(dfmNe)) & colSums(dfmNe)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmNe)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmNe)[h], decreasing=T)),10)
df$name=rownames(df)

g2=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Predicting 'for the'")) + labs(y="Probability",x="Trigram")

#Blog. for_the_follow
h=(grepl('^for\\_the\\_', colnames(dfmBl)) & colSums(dfmBl)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmBl)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmBl)[h], decreasing=T)),10)
df$name=rownames(df)

g3=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Predicting 'for the'")) + labs(y="Probability",x="Trigram")

grid.arrange(g1, g2, g3, nrow=1, ncol=3)


#i_feel_like
#twitter. i_feel_like
h=(grepl('^i\\_feel\\_', colnames(dfmTw)) & colSums(dfmTw)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmTw)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmTw)[h], decreasing=T)),10)
df$name=rownames(df)

g1=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Predicting 'i feel'")) + labs(y="Probability",x="Trigram")

#News. i_feel_like
h=(grepl('^i\\_feel\\_', colnames(dfmNe)) & colSums(dfmNe)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmNe)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmNe)[h], decreasing=T)),10)
df$name=rownames(df)

g2=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Predicting 'i feel'")) + labs(y="Probability",x="Trigram")

#Blog. i_feel_like
h=(grepl('^i\\_feel\\_', colnames(dfmBl)) & colSums(dfmBl)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmBl)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmBl)[h], decreasing=T)),10)
df$name=rownames(df)

g3=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Predicting 'i feel'")) + labs(y="Probability",x="Trigram")

grid.arrange(g1, g2, g3, nrow=1, ncol=3)

#twitter. in_the_first
h=(grepl('^in\\_the\\_', colnames(dfmTw)) & colSums(dfmTw)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmTw)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmTw)[h], decreasing=T)),10)
df$name=rownames(df)

g1=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Predicting 'in the'")) + labs(y="Probability",x="Trigram")

#News. in_the_first
h=(grepl('^in\\_the\\_', colnames(dfmNe)) & colSums(dfmNe)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmNe)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmNe)[h], decreasing=T)),10)
df$name=rownames(df)

g2=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Predicting 'in the'")) + labs(y="Probability",x="Trigram")

#Blog. in_the_first
h=(grepl('^in\\_the\\_', colnames(dfmBl)) & colSums(dfmBl)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmBl)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmBl)[h], decreasing=T)),10)
df$name=rownames(df)

g3=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Predicting 'in the'")) + labs(y="Probability",x="Trigram")

grid.arrange(g1, g2, g3, nrow=1, ncol=3)



#twitter. i_love_you
h=(grepl('^i\\_love\\_', colnames(dfmTw)) & colSums(dfmTw)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmTw)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmTw)[h], decreasing=T)),10)
df$name=rownames(df)

g1=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Predicting 'i love'")) + labs(y="Probability",x="Trigram")

#News. i_love_you
h=(grepl('^i\\_love\\_', colnames(dfmNe)) & colSums(dfmNe)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmNe)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmNe)[h], decreasing=T)),10)
df$name=rownames(df)

g2=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Predicting 'i love'")) + labs(y="Probability",x="Trigram")

#Blog. i_love_you
h=(grepl('^i\\_love\\_', colnames(dfmBl)) & colSums(dfmBl)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmBl)[h], decreasing=T)

df=head(data.frame(count=sort(colSums(dfmBl)[h], decreasing=T)),10)
df$name=rownames(df)

g3=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Predicting 'i love'")) + labs(y="Probability",x="Trigram")

grid.arrange(g1, g2, g3, nrow=1, ncol=3)

Data Science Capstone Milestone Report - Predicting the Next Word

Matias Thayer

Jun 10, 2016