suppressMessages(library(quanteda))
suppressMessages(library(tokenizers))
suppressMessages(library(stopwords))
suppressMessages(library(kableExtra))
suppressMessages(library(tm))
suppressMessages(library(wordcloud))
suppressMessages(library(ngram))

Introduction

This document considers three large data sets consisting of lines of data from twitter, from the news and from blogs in the USA. We will see that the files are much to large to preform analysis directly on. Thus we first make random sub samples that allow for further analysis. We use the ‘tm’ package to clean up the documents before analysis.

# Read in the data using readLines to get a vector instead of a dataframe. The vector works better for when we need to use grep.
#skipNull being true means we keep blank strings as is instead of missing or NA.

setwd("C:/Users/User/Desktop/Capstone")
us.twitter=readLines('en_US.twitter.txt',skipNul =TRUE,encoding ='UTF-8')
us.news=readLines('en_US.news.txt',skipNul=TRUE,encoding ='UTF-8')
us.blog=readLines('en_us.blogs.txt',skipNul=TRUE,encoding ='UTF-8')

# Find the longest character entries for each of the three data sets.

max.char.twitter=0
max.char.news=0
max.char.blogs=0

for(i in 1:length(us.twitter)){
  if(nchar(us.twitter[i])>max.char.twitter){
    max.char.twitter=nchar(us.twitter[i])
  }
}

for(i in 1:length(us.news)){
  if(nchar(us.news[i])>max.char.news){
    max.char.news=nchar(us.news[i])
  }
}


for(i in 1:length(us.blog)){
  if(nchar(us.blog[i])>max.char.blogs){
    max.char.blogs=nchar(us.blog[i])
  }
}


#g_love=length(grep("love",us.twitter)) #Number of times the word 'love' appears is us twitter data set.
#g_hate=length(grep("hate",us.twitter)) #Number of times the word 'hate' appears in the us twitter data set.

#g_bio=which(grepl("biostats",us.twitter)) #The index where 'biostats' is mentioned in the us.twitter data set.

#g_chess=length(grep("A computer once beat me at chess, but it was no match for me at kickboxing",us.twitter))

#dup.twit=which(duplicated(us.twitter)) #Which tweets appear multiple times

#dup.news=which(duplicated(us.news))

#dup.blogs=which(duplicated(us.blog))

Summarize the data

We now take a look at each of the three text files and provide basic summaries of each. We make use of the ‘tokenizers’ package in R to count characters and words in each of our text files.

lines.twitter=length(us.twitter)
lines.news=length(us.news)
lines.blog=length(us.blog)

char.twitter=sum(count_characters(us.twitter))
char.news=sum(count_characters(us.news))
char.blog=sum(count_characters(us.blog))

words.twitter=sum(count_words(us.twitter))
words.news=sum(count_words(us.news))
words.blog=sum(count_words(us.blog))


row_names=c("Twitter","News","Blogs")

total_lines=c(lines.twitter,lines.news,lines.blog)

total_chars=c(char.twitter,char.news,char.blog)

total_words=c(words.twitter,words.news,words.blog)


total_matrix=data.frame(row_names,total_lines,total_chars,total_words)

names(total_matrix)=c("Source","Total Lines","Total Characters","Total Words")

kable(total_matrix,caption="Summary of full data sets") %>%
  kable_styling('striped')

Summary of full data sets
Source	Total Lines	Total Characters	Total Words
Twitter	2360148	162095982	30093414
News	77259	15639408	2674536
Blogs	899288	206824257	37546239

This kable plot summarizes the total number of lines in each data set as well as the total number of characters. We can see how large these data sets are.

Create sub samples for analysis

Because the total number of lines in the original data sets is in the millions, we will take a small sample instead and preform the analysis on this sample.

#Want to sample without replacment, take a random sample of each.

us.twit.sub=sample(us.twitter,size=0.002*length(us.twitter),replace=FALSE)
us.twit.sub.replicate=us.twit.sub


us.news.sub=sample(us.news,size=0.04*length(us.news),replace=FALSE)
us.blog.sub=sample(us.blog,size=0.003*length(us.blog),replace=FALSE)

us.text=c(us.twit.sub,us.news.sub,us.blog.sub)
us.text=as.data.frame(us.text)

lines.twitter.sub=length(us.twit.sub)
lines.news.sub=length(us.news.sub)
lines.blog.sub=length(us.blog.sub)

char.twitter.sub=sum(nchar(us.twit.sub))
char.news.sub=sum(nchar(us.news.sub))
char.blog.sub=sum(nchar(us.blog.sub))


words.twitter.sub=sum(count_words(us.twit.sub))
words.news.sub=sum(count_words(us.news.sub))
words.blog.sub=sum(count_words(us.blog.sub))


row_names=c("Twitter","News","Blogs")

sub_lines=c(lines.twitter.sub,lines.news.sub,lines.blog.sub)

sub_chars=c(char.twitter.sub,char.news.sub,char.blog.sub)

sub_words=c(words.twitter.sub,words.news.sub,words.blog.sub)

sub_matrix=data.frame(row_names,sub_lines,sub_chars,sub_words)

names(sub_matrix)=c("Source","Total Lines","Total Characters","Total Words")

kable(sub_matrix,caption="Summary on reduced data set") %>%
  kable_styling('striped')

Summary on reduced data set
Source	Total Lines	Total Characters	Total Words
Twitter	4720	328922	60875
News	3090	619595	105739
Blogs	2697	655104	118508

We can see the new summary on the data set containing a random sample of each data set. The size was chosen as large as possible that allowed R to have enough memory to compile the data.

Create Corpus

The text files are still not very clean, there is extra whitespace, emoticons and special characters. We use the “tm” package

https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf to build our “corpus”. A corupus is a collection of texts, similar to a dataset. The term corupus is used when we are specfically working with text documents and word processing. The “tm” package allows us to clean our text files.

Twitter Corpus

us.twit.sub=iconv(us.twit.sub, "UTF-8", "ASCII", sub="") #Remove icons/emoticons.


twitter_corpus=Corpus(VectorSource(us.twit.sub))  #Build the Corpus object

twitter_corpus=tm_map(twitter_corpus,tolower) #Convert all text to lower case.

twitter_corpus=tm_map(twitter_corpus,removePunctuation) # Remove all punctuation.

twitter_corpus=tm_map(twitter_corpus,removeNumbers) #Remove numbers.

twitter_corpus=tm_map(twitter_corpus,stripWhitespace) #Remove blank white space last.


TDM.tw=TermDocumentMatrix(twitter_corpus)

twitterWords <- tail(sort(rowSums(as.matrix(TDM.tw))), 12)

twitter.cloudWords<-tail(sort(rowSums(as.matrix(TDM.tw))),50)


wordcloud(word=names(twitter.cloudWords),freq=twitter.cloudWords,colors=brewer.pal(9, "Dark2"))

barplot(twitterWords,main="Most frequent words in Twitter sample",col="blue",xlab="Word",ylab="Number of appearances",cex.names=0.8)

Blog Corpus

us.blog.sub=iconv(us.blog.sub, "UTF-8", "ASCII", sub="") #Remove icons/emoticons.


blogs_corpus=Corpus(VectorSource(us.blog.sub))  #Build the Corpus object

blogs_corpus=tm_map(blogs_corpus,tolower) #Convert all text to lower case.

blogs_corpus=tm_map(blogs_corpus,removePunctuation) # Remove all punctuation.

blogs_corpus=tm_map(blogs_corpus,removeNumbers) #Remove numbers.

blogs_corpus=tm_map(blogs_corpus,stripWhitespace) #Remove blank white space last.


TDM.bg=TermDocumentMatrix(blogs_corpus)

blogWords<- tail(sort(rowSums(as.matrix(TDM.bg))), 12)

blog.cloudWords<-tail(sort(rowSums(as.matrix(TDM.bg))),50)


wordcloud(word=names(blog.cloudWords),freq=blog.cloudWords,colors=brewer.pal(7, "Set1"))

barplot(blogWords,main="Most frequent words in blogs sample",col='red',xlab='Word',ylab='Number of appearances',cex.names=0.8)

News Corpus

us.news.sub=iconv(us.news.sub, "UTF-8", "ASCII", sub="") #Remove icons/emoticons.


news_corpus=Corpus(VectorSource(us.news.sub))  #Build the Corpus object

news_corpus=tm_map(news_corpus,tolower) #Convert all text to lower case.

news_corpus=tm_map(news_corpus,removePunctuation) # Remove all punctuation.

news_corpus=tm_map(news_corpus,removeNumbers) #Remove numbers.

news_corpus=tm_map(news_corpus,stripWhitespace) #Remove blank white space last.


TDM.nw=TermDocumentMatrix(news_corpus)

newsWords<- tail(sort(rowSums(as.matrix(TDM.nw))), 12)

news.cloudWords<-tail(sort(rowSums(as.matrix(TDM.nw))),50)



wordcloud(word=names(news.cloudWords),freq=news.cloudWords,colors=brewer.pal(6, "Set2"))

barplot(newsWords,main="Most frequent words in news sample",col='yellow',xlab='Word',ylab='Number of appearances',cex.names =0.8)

N-Grams

We now have seen from each data set the words that appear most often. However, we also want to know which words appear together most often either. A unigram is just a word like what we have seen, a di-gram is a pair of two words that appear together, while a tri-gram is a set of three words that appear together. To accomplish this, we make use of the R package ‘ngram’.

Twitter N-Grams

twitter.tokens <- concatenate(lapply(twitter_corpus, "[", 1)) #This allows us to use the ngram package on tm data.


twitter.bigrams=ngram(twitter.tokens,n=2) #Form the twitter bigram

twitter.trigrams=ngram(twitter.tokens,,n=3) #Form the twitter digram


head(get.phrasetable(twitter.bigrams),n=10)

head(get.phrasetable(twitter.trigrams),n=10)

We can see that the most common bigram from the twitter subsample is “in the” , while the most common trigram is “thanks for the”.

Blogs N-Grams

blogs.tokens <- concatenate(lapply(blogs_corpus, "[", 1)) #This allows us to use the ngram package on tm data.


blogs.bigrams=ngram(blogs.tokens,n=2) #Form the twitter bigram

blogs.trigrams=ngram(blogs.tokens,,n=3) #Form the twitter digram


head(get.phrasetable(blogs.bigrams),n=10)

head(get.phrasetable(blogs.trigrams),n=10)

The most common bigrams for the blog subsample is “of the” while the most common trigram is “one of the”.

News N-Grams

news.tokens <- concatenate(lapply(news_corpus, "[", 1)) #This allows us to use the ngram package on tm data.


news.bigrams=ngram(news.tokens,n=2) #Form the twitter bigram

news.trigrams=ngram(news.tokens,,n=3) #Form the twitter digram


head(get.phrasetable(news.bigrams),n=10)

head(get.phrasetable(news.trigrams),n=10)

The most common bigram for the news data subsample is “of the” while the most common trigram is “one of the”.

The next step will be using our n-grams to make predictions given a first set of words.

Data Science Capstone

Robert Sneiderman

8/10/2019