We performed an exploratory data analysis trying to understand the similarities and differences in three English corpuses (twitter, news, and blog), trying to answer questions such as “Should we combine the three of them to build a predictive model of the next word?” or: Should we use only one of them, assuming that our application will be more oriented to informal text input.
In the following sections, we will see that among the different corpuses (twitter, news and blogs), unigrams and trigrams have some similarities in very common expressions, but also they have considerable differences in the different corpuses. Particularly, twitter users post in a more informal style and using more frequently the first and second person. On the contrary, news contain more formal sentences.
To begin with, let’s load some libaries that aid with the analysis. We will use Quanteda for building the corpuses and tokenize the ngrams.
The data for this capstone has been obtained from HC corpora, and consists of a collection of publicly available sources collected by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. This capstone project involves working with 12 data files, 3 set of files for four different languajes: German, English, Finnish, and Norwegian (de_DE, en_US, fi_FI, ru_RU respectively). Our exploratory analysis will be performed using only the English files. These are as follows:
Then we read the files using the readLines function and some cleaning is applied in the case of the blog entries.
When loading the data, we took some samples in order to work more quickly. We use a seedso the results are reproducible.
Lets see a summary of the lengths of the diferent texts
summary(str_length(blog)) %>% pander
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 0 | 45 | 151 | 224.3 | 320 | 13930 |
summary(str_length(news)) %>% pander
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 0 | 106 | 180 | 196 | 261 | 7586 |
summary(str_length(twitter)) %>% pander
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 2 | 35 | 61 | 65.87 | 96 | 145 |
We can see above that within blog entries it’s much more likely to find long entries than in news and of course twitter.
temp=as.data.frame(head(twitter[str_length(twitter)>170], 10))
print(xtable(temp),type = "html", comment=F, include.rownames=FALSE)
We created three different corpuses, for the twitter, news and blog data sets respectively. We didn’t remove stop words and did not apply stem, because those elements can help us to predict the next words. Also we tokenized the corpuses and obtained unigrams and trigrams
Let’s display a bag of words visualization for Twitter, news and blogs respectively:
Some findings from these unigrams:
We can see some differences between the different type of corpus. For example, the words such as “you”, “your”, “i”, “i’am”, “me”, and “my” are more likely to occur in a tweet than in a blog post and much less likely to occur in a news entry. That makes sense since the twitter style is more informal, whereas usually the news are presented in a more formal language and using the third person.
In the news corpus we can found some top words that are verbs in the past form, such as “said”, “has”, “was”.
In the twitter corpus the word “just” is one of the most frequent, but it doesn’t appear as one of the top ones in the blog and the news corpus either.
Now, we will select some frequent unigrams and compare how they relate with other words using the similarity function provided by Quanteda.
## i'm :
## said :
## love :
We can see that there are some important differences for the relationships of some common words such as “i’m”, “love”, and “said” depending on the style of corpus
To finalize our exploratory analysis let’s analyze some top trigrams accross all corpuses and then try to see how different they are regards predicting the next word.
## Twitter top trigrams:
## thanks_for_the thank_you_for looking_forward_to
## 2376 879 873
## i_love_you for_the_follow i_want_to
## 812 805 736
## going_to_be can't_wait_to i_need_to
## 715 708 592
## i_have_a
## 591
## News top trigrams:
## one_of_the a_lot_of as_well_as out_of_the
## 7394 5830 3193 2824
## part_of_the according_to_the the_end_of some_of_the
## 2813 2813 2791 2715
## to_be_a going_to_be
## 2686 2629
## Blogs top trigrams:
## one_of_the a_lot_of it_was_a to_be_a as_well_as out_of_the
## 1512 1291 697 673 671 652
## some_of_the be_able_to i_want_to the_end_of
## 644 639 632 622
We can see here that some of the trigrams have an important frequency in the three corpuses. Examples of this are: “one_of_the” and “a_lot_of”. In addition we can say that the blog and news corpus are closer each other in terms of trigrams. We can see again that the twitter style is a lot more informal and the use of the first and second person is more frequent. Let’s see a couple of these trigrams and look if they can produce different predictions depending on the style of the text.
In this section we will try we will explore some of these trigrams, and see the likelihood of the third word given the rest. For example, for the trigram “i love you” we will see what is the estimated probability of predicting “you” given that the two previous words are “i love”, and compare that probability with other trigrams whose third word is “you” and among the three different corpuses.
Our next steps here are to take these ngram we have computed, along with bigrams and perhaps four-gram and build a model that best predicts an unseen word given some text. The way we are implementig this is by building matrices of our unigrams, bigrams and trigrams storing the frequency of each value in the train set. Then build an algorithm that is capable to assign some probability to unseen events, using one of the smoothing methods available or stupid backoff.
As we saw in our exploratory analysis, a twitter corpus can produce different predictions than a corpus that contains only news. In this way, we will assume that our predictive model will be consumed by a twitter-like application and thus we would use only the twitter corpus.
Since the application will live as a shiny application that has some constraints in terms of memory and processor resources, we will have to build our algorithm having those elements in mind.
#install.packages("needs")
library(needs)
needs(dplyr, tidyr, stringr, lubridate, readr, ggplot2,
MASS,
pander, formattable, viridis,
SnowballC, tm, wordcloud, RColorBrewer, RWeka, xtable, quanteda, gridExtra)
# Opening the files
set.seed(1)
# Twitter:
con <- file("../input/en_US/en_US.twitter.txt", "r")
data<-readLines(con, n=-1) ## Read the entire file
close(con)
data<-gsub("’", "'", data)
data<-gsub("“", "", data)
data<-gsub("â€", "", data)
data<-gsub(" ", " ", data)
data<-gsub(" ", " ", data)
data<-gsub(" ", " ", data)
data<-gsub(" & ", " and ", data)
data<-gsub(" an ", " and ", data)
data<-gsub("[^a-zA-Z0-9 '-]", "", data)
data=str_trim(data)
h<-sample(length(data), length(data)/10)
twitter=data[h]
# Blogs:
con <- file("../input/en_US/en_US.blogs.txt", "r")
data<-readLines(con, n=-1) ## Read the entire file
close(con)
data<-gsub("’", "'", data)
data<-gsub("“", "", data)
data<-gsub("â€", "", data)
data<-gsub(" ", " ", data)
data<-gsub(" ", " ", data)
data<-gsub(" ", " ", data)
data<-gsub(" & ", " and ", data)
data<-gsub(" an ", " and ", data)
data<-gsub("[^a-zA-Z0-9 '-]", "", data)
data=str_trim(data)
h<-sample(length(data), length(data)/10)
blog=data[h]
#here we do some cleaning to the blogs data
blog<-gsub("’", "'", blog)
blog=str_trim(blog)
# News:
con <- file("../input/en_US/en_US.news.txt", "rb")
data<-readLines(con, n=-1) ## Read the entire file
close(con)
data<-gsub("’", "'", data)
data<-gsub("“", "", data)
data<-gsub("â€", "", data)
data<-gsub(" ", " ", data)
data<-gsub(" ", " ", data)
data<-gsub(" ", " ", data)
data<-gsub(" & ", " and ", data)
data<-gsub(" an ", " and ", data)
data<-gsub("[^a-zA-Z0-9 '-]", "", data)
data=str_trim(data)
h<-sample(length(data), length(data)/2)
news=data[h]
data=NULL
corpusTw <- corpus(twitter) # build the corpus
corpusNe <- corpus(news) # build the corpus
corpusBl <- corpus(blog) # build the corpus
#Constructing a document-frequency matrix for unigrams
dfmTw <- dfm(corpusTw, verbose=FALSE, removeTwitter=TRUE)
dfmNe <- dfm(corpusNe, verbose=FALSE, removeTwitter=TRUE)
dfmBl <- dfm(corpusBl, verbose=FALSE, removeTwitter=TRUE)
par(mfrow = c(1,3))
if (require(RColorBrewer))
plot(dfmTw, max.words = 30, colors = brewer.pal(6, "Dark2"), scale = c(8, .5), main = "Twitter")
if (require(RColorBrewer))
plot(dfmNe, max.words = 30, colors = brewer.pal(7, "Dark2"), scale = c(8, .5), main = "News")
if (require(RColorBrewer))
plot(dfmBl, max.words = 30, colors = brewer.pal(8, "Dark2"), scale = c(8, .5), main = "Blog")
#List of words to see relationship:
wordsToSee=c("i'm", 'said', 'love')
simTw=similarity(dfmTw, selection=wordsToSee, n = NULL,
margin = c("features"), method = "correlation", normalize = FALSE)
simNe=similarity(dfmNe, selection=wordsToSee, n = NULL,
margin = c("features"), method = "correlation", normalize = FALSE)
simBl=similarity(dfmBl, selection=wordsToSee, n = NULL,
margin = c("features"), method = "correlation", normalize = FALSE)
for(i in 1:length(wordsToSee)){
word=wordsToSee[i]
cat(word, ":\n")
##Twitter
df=as.data.frame(head(simTw[[word]], 10));df$feature=names(head(simTw[[word]], 10))
names(df)[1]<-'Cor'
g1=ggplot(df, aes(x = reorder(feature, Cor), y = Cor)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Top cor to '", word, "'"))
##News
df=as.data.frame(head(simNe[[word]], 10));df$feature=names(head(simNe[[word]], 10))
names(df)[1]<-'Cor'
g2=ggplot(df, aes(x = reorder(feature, Cor), y = Cor)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Top cor to '", word, "'"))
##Blogs
df=as.data.frame(head(simBl[[word]], 10));df$feature=names(head(simBl[[word]], 10))
names(df)[1]<-'Cor'
g3=ggplot(df, aes(x = reorder(feature, Cor), y = Cor)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Top cor to '", word, "'"))
grid.arrange(g1, g2, g3, nrow=1, ncol=3)
}
Now we will see the top 10 trigrams for each type of corpus:
dfmTw <- dfm(corpusTw, ngrams = 3, removeTwitter=TRUE, verbose=FALSE)
dfmNe <- dfm(corpusNe, ngrams = 3, removeTwitter=TRUE, verbose=FALSE)
dfmBl <- dfm(corpusBl, ngrams = 3, removeTwitter=TRUE, verbose=FALSE)
cat("Twitter top trigrams:\n")
topfeatures(dfmTw, 10)
cat("News top trigrams:\n")
topfeatures(dfmNe, 10)
cat("Blogs top trigrams:\n")
topfeatures(dfmBl, 10)
#for_the_follow
#twitter. for_the_follow
h=(grepl('^for\\_the\\_', colnames(dfmTw)) & colSums(dfmTw)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmTw)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmTw)[h], decreasing=T)),10)
df$name=rownames(df)
g1=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Predicting 'for the'")) + labs(y="Probability",x="Trigram")
#News. for_the_follow
h=(grepl('^for\\_the\\_', colnames(dfmNe)) & colSums(dfmNe)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmNe)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmNe)[h], decreasing=T)),10)
df$name=rownames(df)
g2=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Predicting 'for the'")) + labs(y="Probability",x="Trigram")
#Blog. for_the_follow
h=(grepl('^for\\_the\\_', colnames(dfmBl)) & colSums(dfmBl)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmBl)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmBl)[h], decreasing=T)),10)
df$name=rownames(df)
g3=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Predicting 'for the'")) + labs(y="Probability",x="Trigram")
grid.arrange(g1, g2, g3, nrow=1, ncol=3)
#i_feel_like
#twitter. i_feel_like
h=(grepl('^i\\_feel\\_', colnames(dfmTw)) & colSums(dfmTw)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmTw)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmTw)[h], decreasing=T)),10)
df$name=rownames(df)
g1=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Predicting 'i feel'")) + labs(y="Probability",x="Trigram")
#News. i_feel_like
h=(grepl('^i\\_feel\\_', colnames(dfmNe)) & colSums(dfmNe)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmNe)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmNe)[h], decreasing=T)),10)
df$name=rownames(df)
g2=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Predicting 'i feel'")) + labs(y="Probability",x="Trigram")
#Blog. i_feel_like
h=(grepl('^i\\_feel\\_', colnames(dfmBl)) & colSums(dfmBl)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmBl)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmBl)[h], decreasing=T)),10)
df$name=rownames(df)
g3=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Predicting 'i feel'")) + labs(y="Probability",x="Trigram")
grid.arrange(g1, g2, g3, nrow=1, ncol=3)
#twitter. in_the_first
h=(grepl('^in\\_the\\_', colnames(dfmTw)) & colSums(dfmTw)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmTw)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmTw)[h], decreasing=T)),10)
df$name=rownames(df)
g1=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Predicting 'in the'")) + labs(y="Probability",x="Trigram")
#News. in_the_first
h=(grepl('^in\\_the\\_', colnames(dfmNe)) & colSums(dfmNe)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmNe)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmNe)[h], decreasing=T)),10)
df$name=rownames(df)
g2=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Predicting 'in the'")) + labs(y="Probability",x="Trigram")
#Blog. in_the_first
h=(grepl('^in\\_the\\_', colnames(dfmBl)) & colSums(dfmBl)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmBl)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmBl)[h], decreasing=T)),10)
df$name=rownames(df)
g3=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Predicting 'in the'")) + labs(y="Probability",x="Trigram")
grid.arrange(g1, g2, g3, nrow=1, ncol=3)
#twitter. i_love_you
h=(grepl('^i\\_love\\_', colnames(dfmTw)) & colSums(dfmTw)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmTw)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmTw)[h], decreasing=T)),10)
df$name=rownames(df)
g1=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Twitter. Predicting 'i love'")) + labs(y="Probability",x="Trigram")
#News. i_love_you
h=(grepl('^i\\_love\\_', colnames(dfmNe)) & colSums(dfmNe)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmNe)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmNe)[h], decreasing=T)),10)
df$name=rownames(df)
g2=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("News. Predicting 'i love'")) + labs(y="Probability",x="Trigram")
#Blog. i_love_you
h=(grepl('^i\\_love\\_', colnames(dfmBl)) & colSums(dfmBl)>=1)
n=sum(colSums(dfmBl)[h])
wordFreq=sort(colSums(dfmBl)[h], decreasing=T)
df=head(data.frame(count=sort(colSums(dfmBl)[h], decreasing=T)),10)
df$name=rownames(df)
g3=ggplot(df, aes(x = reorder(name, count), y = count/n)) +
geom_bar(stat = "identity", fill="gray") + coord_flip() +
ggtitle(paste0("Blog. Predicting 'i love'")) + labs(y="Probability",x="Trigram")
grid.arrange(g1, g2, g3, nrow=1, ncol=3)