Exploratory Analysis NLP

1. Data source

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The data is downloaded from the link Data

2. Objective

The goal of this project is:

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
  2. Profanity filtering - removing profanity and other words you do not want to predict.
  3. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  4. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Quetions to consider:

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage - identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

3. Required Packages

library(stringi)
library(dplyr)
library(tm)
library(wordcloud)
library(ggplot2)
library(gridExtra)
library(RWeka)

4. Loading Data

The data is first downloaded and extracted. Then, dataset consisting of english foreign language is considered for the analysis.

twitter <- readLines("en_US.twitter.txt")
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")

5. Data Overview

In order to get some sense of the data set, some basic features, such as file size in bytes, number of lines, number of words, average word count per line are plotted in table format.

details <- data.frame(Name = c("twitter", "blogs", "news"),
          Size_Bytes = c(file.info("en_US.twitter.txt")$size, file.info("en_US.blogs.txt")$size,
          file.info("en_US.news.txt")$size), 
          Length = c(length(twitter), length(blogs), length(news)),
          Word_count = c(sum(stri_count_words(twitter)),sum(stri_count_words(blogs)),
                sum(stri_count_words(news))))
details <- mutate(details, Words_per_line = Word_count/Length)
print(details)
##      Name Size_Bytes  Length Word_count Words_per_line
## 1 twitter  167105338 2360148   30218125       12.80349
## 2   blogs  210160014  899288   38154238       42.42716
## 3    news  205811889   77259    2693898       34.86840

From the table, the file size is indeed too large, and therefore random sampling should be done on the data set and analysis should be done.

6. Data Sampling and Clensing

We have three different data files from three sources. Considering limitations such as file size and speed, a sampling of only 1000 will be done on each data set.

Here is the full list of profanity words which can be downloaded and extracted from badwords

tweet_sample <- sample(twitter, 1000)
blogs_sample<- sample(blogs, 1000)
news_sample <- sample(news, 1000)
sample_data <- c(tweet_sample,blogs_sample,news_sample)
rm(tweet_sample, blogs_sample, news_sample)

badwords <- readLines("full-list-of-bad-words-text-file_2018_03_26.txt")

text_data <- function(x) {
  text <- paste(x)
  text <- removePunctuation(text)
  ## Removing special characters
  text <- iconv(text, "UTF-8", "ASCII", sub = "")
  text <- removeNumbers(text)
  ## Converting to lower case
  text <- tolower(text)
  text <- gsub("\\b[a-z]\\b{1}", replace= " ", text)
  text <- removeWords(text, c(badwords, "s","ve", "m"))
  text <- stripWhitespace(text)
}

text <- text_data(sample_data)

## Removing stopwords
text1 <- removeWords(text, c(stopwords("english")))
text1 = stripWhitespace(text1)

7. Exploratory Analysis

7.1 Some words are more frequent than others - what are the distributions of word frequencies?

wordcloud(text,random.order = FALSE, max.words = 50, col = rainbow(3))

7.2 What are the frequencies of 2-grams and 3-grams in the dataset?

Using the functions described below, we generate unigrams, bigrams and trigrams from the cleaned data and plotted.

## Stopwords

ns_gram <- NGramTokenizer(text1)
grams1 <- function(x){
  x <- NGramTokenizer(ns_gram, Weka_control(min=1, max=1))
  x <- data.frame(table(x))
  x <- arrange(x, desc(Freq))
}

grams2 <- function(x){
  x <- NGramTokenizer(ns_gram, Weka_control(min=2, max=2))
  x <- data.frame(table(x))
  x <- arrange(x, desc(Freq))
}

grams3 <- function(x){
  x <- NGramTokenizer(ns_gram, Weka_control(min=3, max=3))
  x <- data.frame(table(x))
  x <- arrange(x, desc(Freq))
}

unigrams <- grams1(ns_gram)
bigrams <- grams2(ns_gram)
trigrams <- grams3(ns_gram)

## Unigram Plotting
p1s <- ggplot(unigrams[1:20,], aes(x = reorder(x, Freq), y = Freq)) +
  geom_bar(stat='identity', aes(fill = x)) +
  geom_text(aes(x = x, y = 1, label = Freq[1:20]),
            hjust=0, vjust=0.5, size = 4, colour = 'black',
            fontface = 'bold') +
  guides(fill = FALSE) +
  xlab("Uni-grams") + ylab("Frequency") + ggtitle("Top 20 Uni-grams")+
  coord_flip() + 
  theme_bw()


## Bigram Plotting
p2s <- ggplot(bigrams[1:20,], aes(x = reorder(x, Freq), y = Freq)) +
  geom_bar(stat='identity', aes(fill = x)) +
  geom_text(aes(x = x, y = 1, label = Freq[1:20]),
            hjust=0, vjust=0.5, size = 4, colour = 'black',
            fontface = 'bold') +
  guides(fill = FALSE) +
  xlab("Bi-grams") + ylab("Frequency") + ggtitle("Top 20 Bi-grams")+
  coord_flip() + 
  theme_bw()

## Trigram Plotting
p3s <- ggplot(trigrams[1:20,], aes(x = reorder(x, Freq), y = Freq)) +
  geom_bar(stat='identity', aes(fill = x)) +
  geom_text(aes(x = x, y = 1, label = Freq[1:20]),
            hjust=0, vjust=0.5, size = 4, colour = 'black',
            fontface = 'bold') +
  guides(fill = FALSE) +
  xlab("Tri-grams") + ylab("Frequency") + ggtitle("Top 20 Tri-grams")+
  coord_flip() + 
  theme_bw()

grid.arrange(p1s, p2s, p3s, ncol = 3, top = "When stopwords are excluded")

7.3 How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

unique_word_percent <- nrow(unigrams)/sum(unigrams$Freq)
unique_word_percent
## [1] 0.0518642
coverage <- function(x, percent_cover){
  total <- 0
for(i in 1:length(x$Freq)){
  total <-total + x$Freq[i]
  if(total >= percent_cover*sum(x$Freq)){
    return(i)}
}}
 
coverage(unigrams, 0.5)
## [1] 968
coverage(unigrams, 0.9)
## [1] 8559

We need 968 words to cover 50% of all word instances in the language and 8559 words to cover 90% of all word instances in the language.

There’s been an exponential increase in the frequency of words with increase in percentage.

7.4 How do you evaluate how many of the words come from foreign languages?

The best way to perform this task is to compare the data with some well known dictionary language code block. Since most of the data considered is written in English, it is not necessary to do this exploratory analysis.

7.5 Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

The unique words follow close to exponential distribution with increase in coverage. So we can reduce the number of lower frequency unique words and substitute it with similar synonyms.