suppressMessages(library(quanteda))
suppressMessages(library(tokenizers))
suppressMessages(library(stopwords))
suppressMessages(library(kableExtra))
suppressMessages(library(tm))
suppressMessages(library(wordcloud))
suppressMessages(library(ngram))
This document considers three large data sets consisting of lines of data from twitter, from the news and from blogs in the USA. We will see that the files are much to large to preform analysis directly on. Thus we first make random sub samples that allow for further analysis. We use the ‘tm’ package to clean up the documents before analysis.
# Read in the data using readLines to get a vector instead of a dataframe. The vector works better for when we need to use grep.
#skipNull being true means we keep blank strings as is instead of missing or NA.
setwd("C:/Users/User/Desktop/Capstone")
us.twitter=readLines('en_US.twitter.txt',skipNul =TRUE,encoding ='UTF-8')
us.news=readLines('en_US.news.txt',skipNul=TRUE,encoding ='UTF-8')
us.blog=readLines('en_us.blogs.txt',skipNul=TRUE,encoding ='UTF-8')
# Find the longest character entries for each of the three data sets.
max.char.twitter=0
max.char.news=0
max.char.blogs=0
for(i in 1:length(us.twitter)){
if(nchar(us.twitter[i])>max.char.twitter){
max.char.twitter=nchar(us.twitter[i])
}
}
for(i in 1:length(us.news)){
if(nchar(us.news[i])>max.char.news){
max.char.news=nchar(us.news[i])
}
}
for(i in 1:length(us.blog)){
if(nchar(us.blog[i])>max.char.blogs){
max.char.blogs=nchar(us.blog[i])
}
}
#g_love=length(grep("love",us.twitter)) #Number of times the word 'love' appears is us twitter data set.
#g_hate=length(grep("hate",us.twitter)) #Number of times the word 'hate' appears in the us twitter data set.
#g_bio=which(grepl("biostats",us.twitter)) #The index where 'biostats' is mentioned in the us.twitter data set.
#g_chess=length(grep("A computer once beat me at chess, but it was no match for me at kickboxing",us.twitter))
#dup.twit=which(duplicated(us.twitter)) #Which tweets appear multiple times
#dup.news=which(duplicated(us.news))
#dup.blogs=which(duplicated(us.blog))
We now take a look at each of the three text files and provide basic summaries of each. We make use of the ‘tokenizers’ package in R to count characters and words in each of our text files.
lines.twitter=length(us.twitter)
lines.news=length(us.news)
lines.blog=length(us.blog)
char.twitter=sum(count_characters(us.twitter))
char.news=sum(count_characters(us.news))
char.blog=sum(count_characters(us.blog))
words.twitter=sum(count_words(us.twitter))
words.news=sum(count_words(us.news))
words.blog=sum(count_words(us.blog))
row_names=c("Twitter","News","Blogs")
total_lines=c(lines.twitter,lines.news,lines.blog)
total_chars=c(char.twitter,char.news,char.blog)
total_words=c(words.twitter,words.news,words.blog)
total_matrix=data.frame(row_names,total_lines,total_chars,total_words)
names(total_matrix)=c("Source","Total Lines","Total Characters","Total Words")
kable(total_matrix,caption="Summary of full data sets") %>%
kable_styling('striped')
| Source | Total Lines | Total Characters | Total Words |
|---|---|---|---|
| 2360148 | 162095982 | 30093414 | |
| News | 77259 | 15639408 | 2674536 |
| Blogs | 899288 | 206824257 | 37546239 |
This kable plot summarizes the total number of lines in each data set as well as the total number of characters. We can see how large these data sets are.
Because the total number of lines in the original data sets is in the millions, we will take a small sample instead and preform the analysis on this sample.
#Want to sample without replacment, take a random sample of each.
us.twit.sub=sample(us.twitter,size=0.002*length(us.twitter),replace=FALSE)
us.twit.sub.replicate=us.twit.sub
us.news.sub=sample(us.news,size=0.04*length(us.news),replace=FALSE)
us.blog.sub=sample(us.blog,size=0.003*length(us.blog),replace=FALSE)
us.text=c(us.twit.sub,us.news.sub,us.blog.sub)
us.text=as.data.frame(us.text)
lines.twitter.sub=length(us.twit.sub)
lines.news.sub=length(us.news.sub)
lines.blog.sub=length(us.blog.sub)
char.twitter.sub=sum(nchar(us.twit.sub))
char.news.sub=sum(nchar(us.news.sub))
char.blog.sub=sum(nchar(us.blog.sub))
words.twitter.sub=sum(count_words(us.twit.sub))
words.news.sub=sum(count_words(us.news.sub))
words.blog.sub=sum(count_words(us.blog.sub))
row_names=c("Twitter","News","Blogs")
sub_lines=c(lines.twitter.sub,lines.news.sub,lines.blog.sub)
sub_chars=c(char.twitter.sub,char.news.sub,char.blog.sub)
sub_words=c(words.twitter.sub,words.news.sub,words.blog.sub)
sub_matrix=data.frame(row_names,sub_lines,sub_chars,sub_words)
names(sub_matrix)=c("Source","Total Lines","Total Characters","Total Words")
kable(sub_matrix,caption="Summary on reduced data set") %>%
kable_styling('striped')
| Source | Total Lines | Total Characters | Total Words |
|---|---|---|---|
| 4720 | 328922 | 60875 | |
| News | 3090 | 619595 | 105739 |
| Blogs | 2697 | 655104 | 118508 |
We can see the new summary on the data set containing a random sample of each data set. The size was chosen as large as possible that allowed R to have enough memory to compile the data.
The text files are still not very clean, there is extra whitespace, emoticons and special characters. We use the “tm” package
https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf to build our “corpus”. A corupus is a collection of texts, similar to a dataset. The term corupus is used when we are specfically working with text documents and word processing. The “tm” package allows us to clean our text files.
us.twit.sub=iconv(us.twit.sub, "UTF-8", "ASCII", sub="") #Remove icons/emoticons.
twitter_corpus=Corpus(VectorSource(us.twit.sub)) #Build the Corpus object
twitter_corpus=tm_map(twitter_corpus,tolower) #Convert all text to lower case.
twitter_corpus=tm_map(twitter_corpus,removePunctuation) # Remove all punctuation.
twitter_corpus=tm_map(twitter_corpus,removeNumbers) #Remove numbers.
twitter_corpus=tm_map(twitter_corpus,stripWhitespace) #Remove blank white space last.
TDM.tw=TermDocumentMatrix(twitter_corpus)
twitterWords <- tail(sort(rowSums(as.matrix(TDM.tw))), 12)
twitter.cloudWords<-tail(sort(rowSums(as.matrix(TDM.tw))),50)
wordcloud(word=names(twitter.cloudWords),freq=twitter.cloudWords,colors=brewer.pal(9, "Dark2"))
barplot(twitterWords,main="Most frequent words in Twitter sample",col="blue",xlab="Word",ylab="Number of appearances",cex.names=0.8)
us.blog.sub=iconv(us.blog.sub, "UTF-8", "ASCII", sub="") #Remove icons/emoticons.
blogs_corpus=Corpus(VectorSource(us.blog.sub)) #Build the Corpus object
blogs_corpus=tm_map(blogs_corpus,tolower) #Convert all text to lower case.
blogs_corpus=tm_map(blogs_corpus,removePunctuation) # Remove all punctuation.
blogs_corpus=tm_map(blogs_corpus,removeNumbers) #Remove numbers.
blogs_corpus=tm_map(blogs_corpus,stripWhitespace) #Remove blank white space last.
TDM.bg=TermDocumentMatrix(blogs_corpus)
blogWords<- tail(sort(rowSums(as.matrix(TDM.bg))), 12)
blog.cloudWords<-tail(sort(rowSums(as.matrix(TDM.bg))),50)
wordcloud(word=names(blog.cloudWords),freq=blog.cloudWords,colors=brewer.pal(7, "Set1"))
barplot(blogWords,main="Most frequent words in blogs sample",col='red',xlab='Word',ylab='Number of appearances',cex.names=0.8)
us.news.sub=iconv(us.news.sub, "UTF-8", "ASCII", sub="") #Remove icons/emoticons.
news_corpus=Corpus(VectorSource(us.news.sub)) #Build the Corpus object
news_corpus=tm_map(news_corpus,tolower) #Convert all text to lower case.
news_corpus=tm_map(news_corpus,removePunctuation) # Remove all punctuation.
news_corpus=tm_map(news_corpus,removeNumbers) #Remove numbers.
news_corpus=tm_map(news_corpus,stripWhitespace) #Remove blank white space last.
TDM.nw=TermDocumentMatrix(news_corpus)
newsWords<- tail(sort(rowSums(as.matrix(TDM.nw))), 12)
news.cloudWords<-tail(sort(rowSums(as.matrix(TDM.nw))),50)
wordcloud(word=names(news.cloudWords),freq=news.cloudWords,colors=brewer.pal(6, "Set2"))
barplot(newsWords,main="Most frequent words in news sample",col='yellow',xlab='Word',ylab='Number of appearances',cex.names =0.8)
We now have seen from each data set the words that appear most often. However, we also want to know which words appear together most often either. A unigram is just a word like what we have seen, a di-gram is a pair of two words that appear together, while a tri-gram is a set of three words that appear together. To accomplish this, we make use of the R package ‘ngram’.
twitter.tokens <- concatenate(lapply(twitter_corpus, "[", 1)) #This allows us to use the ngram package on tm data.
twitter.bigrams=ngram(twitter.tokens,n=2) #Form the twitter bigram
twitter.trigrams=ngram(twitter.tokens,,n=3) #Form the twitter digram
head(get.phrasetable(twitter.bigrams),n=10)
head(get.phrasetable(twitter.trigrams),n=10)
We can see that the most common bigram from the twitter subsample is “in the” , while the most common trigram is “thanks for the”.
blogs.tokens <- concatenate(lapply(blogs_corpus, "[", 1)) #This allows us to use the ngram package on tm data.
blogs.bigrams=ngram(blogs.tokens,n=2) #Form the twitter bigram
blogs.trigrams=ngram(blogs.tokens,,n=3) #Form the twitter digram
head(get.phrasetable(blogs.bigrams),n=10)
head(get.phrasetable(blogs.trigrams),n=10)
The most common bigrams for the blog subsample is “of the” while the most common trigram is “one of the”.
news.tokens <- concatenate(lapply(news_corpus, "[", 1)) #This allows us to use the ngram package on tm data.
news.bigrams=ngram(news.tokens,n=2) #Form the twitter bigram
news.trigrams=ngram(news.tokens,,n=3) #Form the twitter digram
head(get.phrasetable(news.bigrams),n=10)
head(get.phrasetable(news.trigrams),n=10)
The most common bigram for the news data subsample is “of the” while the most common trigram is “one of the”.
The next step will be using our n-grams to make predictions given a first set of words.