Synopsis

An exploratory analysis on data extracts from twitter, news and blogs of US has been performed. Data was extracted in 2016 and the records span in time from then to back 10 years. The exploratory analysis has consisted in:

  1. Identifying the main characteristics of each data file, such as size, total number of text lines, largest line and number of words.

  2. Developing 1-gram and 3-gram models. These allow to understand better the dataset and the most common words and expressions.

Data characterization

# Function to extract the data and provide some general stats of it.
fileInformation <- function(filepath) {
  size <- file.info(filepath)$size/1048576
  
  conn <- file(filepath, "r")
  fulltext <- readLines(conn, skipNul = TRUE)
  nlines <- length(fulltext)
  
  maxline <- 0
  for (i in 1:nlines) {
    linelength <- nchar(fulltext[i])
    if (linelength > maxline) { maxline <- linelength }
  }
  close(conn)
  
    infotext <- data.frame(file = basename(filepath), 
                          size = sprintf("%.1f", size), 
                          numberLines =  nlines,
                          maxLine = maxline)
                     
  return( list(fulltext, infotext))
  
}

twitter_info <- fileInformation("~/Coursera-SwiftKey/en_US/en_US.twitter.txt")
news_info <- fileInformation("~/Coursera-SwiftKey/en_US/en_US.news.txt")
blog_info <- fileInformation("~/Coursera-SwiftKey/en_US/en_US.blogs.txt")

basic_info <-rbind(twitter_info[[2]],news_info[[2]],blog_info[[2]])

basic_info

The main characteristics of the files are (size in MB):

##                file  size numberLines maxLine
## 1 en_US.twitter.txt 159.4     2360148     140
## 2    en_US.news.txt 196.3     1010242   11384
## 3   en_US.blogs.txt 200.4      899288   40833

N-gram model

The model to predict words is a model based on assigning probabilities to sequences of words, these models are called Language Models (LMs), and the simplest one is the N-gram.

N-grams are an old approach to language modeling that, given its simplicity, have been proved quite successful.

An N-gram model requires the tokenization of the data corpus. This is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. In the process of tokenization, some characters like punctuation marks are discarded.

The data corpus have been sampled and processed for its tokenization.

library(tm)

twitter <- twitter_info[[1]]
news <- news_info[[1]]
blog <- blog_info[[1]]

# Sample the data
set.seed(140420)
# Sample the data to simplify data processing. 
data.sample <- c(sample(twitter, length(twitter) * 0.03),
                 sample(news, length(news) * 0.03),
                 sample(blog, length(blog) * 0.03))

# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
corpus <- tm_map(corpus, removePunctuation)              
corpus <- tm_map(corpus, removeNumbers)                     
corpus <- tm_map(corpus, tolower)                          
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, removeWords, stopwords("en")) 
corpus <- tm_map(corpus, PlainTextDocument)
OnegramTokenizer<-TermDocumentMatrix(corpus, control = list(wordLengths = c(3,Inf)))

# To simplifly data
# OnegramTokenizer<-removeSparseTerms(OnegramTokenizer, 0.75)

# Function to extract frequency of words from a Document-Term Matrix object
Freq <- function(x){
  a<-data.frame(x$i,x$j,x$v)
  a<-aggregate(x$v, by=list(x$i), sum)
  colnames(a)<-c("wordIndex", "frequency")
  a<-a[order(a$frequency, decreasing = TRUE),]
  a$words<-x$dimnames$Terms[a$wordIndex]
  return(a)
  }

# Taking the words and their frequence of the 1-gram object
OnegramData<-Freq(OnegramTokenizer)

library(ggplot2)

ggplot(OnegramData[1:20,], aes(reorder(words, -frequency), frequency)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title= "Top 20 Unigrams", x = "Unigrams", y = "Frequency") +
  theme(axis.text.x = element_text(angle = 60, size = 12, hjust=1))

Unigram

The 1-gram accounts only for words made up of at least 3 characters, this is to avoid the over representation of connectors such as “of”, “a”, etc. The top 20 words are shown in the following picture.

library(ngram)

NLPtrigramTokenizer <- function(x) {
      unlist(lapply(ngrams(words(x), 3), paste, collapse = " "),
             use.names = FALSE)
  }

 TrigramTokenizer<-TermDocumentMatrix(corpus,
                            control=list(tokenize=NLPtrigramTokenizer))
 

 TrigramData<-Freq(TrigramTokenizer)
 
ggplot(TrigramData[1:20,], aes(reorder(words, -frequency), frequency)) +
     geom_bar(stat = "identity", fill = "blue") +
     labs(title= "Top 20 Trigrams", x = "Trigrams", y = "Frequency") +
     theme(axis.text.x = element_text(angle = 60, size = 12, hjust=1))

Trigram

Likewise, the 3-gram accounts only for words that are made up of 3 or more characters. This has a significant impact in the resulting trigram and the drawn conclusions:

Conclusion

Some preliminary conclusions can be drawn from this exploratory analysis:

  1. The 3-gram shown in this report shows a high occurence of festivities and other general discussion topics. Thus, the 3-gram analysis provides an interesting insight of people interests within a determined community/ country/ epoch.

  2. N-gram is an old and simple approach to language modeling that provides a well interpretabe model rather than more advanced models such as Bert and GPT-2 (models that power most of the current natural language processing applications).

  3. So, it is the intention of the author to continue to develop the N-gram models for the purpose of this course.

  4. For a language predictor, all the words should be accounted and longer N-grams should be considered.