Capstone Project - EDA for NLP Task

Abstract

This document is made in fulfillment of the first assignment for the Data Science Capstone (Natural Language Processing), one that is asking to conduct Exploratory Data Analysis. To do this, I’ll mainly utilize the quanteda package per course instructor recommendation in the discussions.

Introduction

I’ll address the following issues:

1- Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?

2- Has the data scientist made basic plots, such as histograms to illustrate features of the data?

3- what are the distributions of word frequencies?

4- What are the frequencies of 2-grams and 3-grams in the dataset?

5- How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

6- How do you evaluate how many of the words come from foreign languages?

Let’s start by loading the basic dependencies:

library(quanteda)
library(ggplot2)
library (manipulate)

#This library helps in deciding the language of words
library(cld2)

set.seed(510)
setwd("C:/Users/ttt/Desktop/final/en_US")
options(download.file.method = "libcurl")

Now that I loaded the needed library, let’s start loading rawdata

#read the corpus to the Global environment and add each corpus' source (blog, twitter or news)
corpblogs <- corpus (readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE))
docvars(corpblogs, "Source") <- "Blog"
corptwit <- corpus (readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE))
docvars(corptwit, "Source") <- "Twitter"
corpnews <- corpus (readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE))
docvars(corpnews, "Source") <- "News"

#combining all the corpii together to analyze them at once
corpall <- corpblogs+corptwit+corpnews

This will results in a huge file that is hard for normal computers to process, so we have to later draw a sample for its usage. Also, it is still in its raw form, no cleaning took place. It may need removing numbers, punctuation, stopwords and profanity. this should be kept in mind.

1- Basic summaries of the three files, Word counts, line counts and basic data tables

Let us start by Line counts of the raw data.

#See how many lines does this big compounded file contain:
length(texts(corpall))
## [1] 3336695
#Number of lines of the Blogs file
length(texts(corpblogs))
## [1] 899288
#Number of lines of the News file
length(texts(corpnews))
## [1] 77259
#Number of lines of the Twitter file
length(texts(corptwit))
## [1] 2360148

When it comes to Word counts, computational power explodes, and normal computers can’t handle counting huge numbers of words. Therefore, we tend to sample the data in order to work with a more practical dataset.

We start by sampling the data:

#gets a random sample of 1% of all the texts, this is a workable sample that can be used on my device
corpallsampled <- corpus_sample(corpall, size=length(texts(corpall))*0.01)

#let's now see how many lines does this abbreviated file contain:
length(texts(corpallsampled))
## [1] 33366

Now that we have the sampled dataset, let’s count “words”

#gets the words of the sampled dataset
allsampledwords <- sum(ntoken(corpallsampled))

#gets the number of words in the sampled dataset whose source is Twitter
twittersampledwords <- sum(ntoken(corpallsampled[docvars(corpallsampled)=="Twitter"]))

#gets the number of words in the sampled dataset whose source is Blogs
Blogsampledwords <- sum(ntoken(corpallsampled[docvars(corpallsampled)=="Blog"]))


#gets the number of words in the sampled dataset whose source is News
Newssampledwords <- sum(ntoken(corpallsampled[docvars(corpallsampled)=="News"]))

#put them all together in a nice table
sampled_words_count <- data.frame(WordsNum_Twitter= twittersampledwords, WordsNum_Blogs = Blogsampledwords, WordsNum_News = Newssampledwords, Wordsnum_All = allsampledwords)
sampled_words_count
##   WordsNum_Twitter WordsNum_Blogs WordsNum_News Wordsnum_All
## 1           369269         421586         29670       820525

Since we sampled 1% only, we can actually easily approximate the numbers in the whole dataset, by multiplying x100

#approximation of number of words in the original large (i.e. non-sampled) dataset
all_words_counts <- sampled_words_count*100
all_words_counts
##   WordsNum_Twitter WordsNum_Blogs WordsNum_News Wordsnum_All
## 1         36926900       42158600       2967000     82052500

2- Basic plots, histograms and features illustration

To explore the data further, we can draw a barplot showing the different sources of texts in the whole dataset

#returns a barplot showing the different sources of the dataset (i.e. the texts)
ggplot(as.data.frame(addmargins(table(docvars(corpall)$Source))), aes(x=reorder(Var1, -Freq), y=Freq/1000))+
      geom_bar(stat="identity")+
      labs(x="Source", y="Number of texts (in thousands)", title = "Summary of the different sources of the data")+theme_minimal()

As for features extraction, We can extract all the features that are in the dataset, but for exhibition purposes, we only show the top features in our sampled dataset as follows

#we start by inspecting the first 20 texts to get an idea of the raw form of the data.
#We will at the same time clean the sample from any punctuation or stopwords
dfm_sort(dfm(head(corpallsampled, 20), remove = stopwords("english"), remove_punct=TRUE))[,1:10]
## Document-feature matrix of: 20 documents, 10 features (91.0% sparse).
## 20 x 10 sparse Matrix of class "dfm"
##               features
## docs           love make 4 just need like want goin documents containing
##   text1620044     1    0 0    1    0    0    0    0         0          0
##   text939689      0    0 0    0    1    1    0    0         0          0
##   text1257057     0    0 0    0    1    1    0    0         0          0
##   text2041483     0    1 0    0    0    0    1    0         0          0
##   text2983671     0    0 0    1    0    0    0    0         0          0
##   text1948542     0    1 0    0    0    0    0    2         0          0
##   text264202      0    0 0    0    0    0    0    0         2          2
##   text1285788     0    0 0    0    0    0    0    0         0          0
##   text11182510    0    0 0    0    0    0    0    0         0          0
##   text1912615     0    0 0    0    0    0    0    0         0          0
##   text1415588     1    0 0    0    0    0    0    0         0          0
##   text2271424     0    0 0    0    0    0    0    0         0          0
##   text897753      0    0 0    0    0    0    0    0         0          0
##   text12497100    0    1 0    0    0    0    0    0         0          0
##   text1851        1    0 0    0    0    0    1    0         0          0
##   text8423121     0    0 0    0    0    0    0    0         0          0
##   text2107264     0    0 0    0    0    0    0    0         0          0
##   text1023946     0    0 0    0    0    0    0    0         0          0
##   text31704       0    0 3    0    0    0    0    0         0          0
##   text724731      0    0 0    0    0    0    0    0         0          0
#We now return the most frequent tokens (or features) in the sampled dataset
sampletopfeatures <- topfeatures(dfm(corpallsampled, remove = stopwords("english"), remove_punct=TRUE), 20)
sampletopfeatures
##   just   like    one    can    get   time   love   good    day    now 
##   2628   2276   2006   1965   1832   1651   1527   1515   1470   1468 
##   know    new people    see     go  great   back  think   make  going 
##   1421   1265   1178   1175   1161   1126   1119   1104   1019    999
#Let's now see the 10 most frequent words according with texts' source grouping
addmargins(
      dfm_sort(dfm(corpallsampled, groups ="Source" ,remove = stopwords("english"), remove_punct=TRUE))[,1:10]
      #this one removes the sum column that might cofnuse things
      )[,-11]
##          features
## docs      just like  one  can  get time love good  day  now
##   Blog    1044 1037 1166 1016  645  889  442  490  520  578
##   News      34   34   60   31   25   20   11   17   19   26
##   Twitter 1550 1205  780  918 1162  742 1074 1008  931  864
##   Sum     2628 2276 2006 1965 1832 1651 1527 1515 1470 1468

We can make something more graphic to reflect features (i.e. most frequent words) eminence using wordclouds as such

#Let's make a word cloud for the sampled dataset
textplot_wordcloud(dfm(corpallsampled, remove = stopwords("english"), remove_punct=TRUE),
                   min_count = 1000, random_order = FALSE, rotation = .25,
                   color = RColorBrewer::brewer.pal(8, "Dark2"))

#Let's make a word cloud for the sampled dataset with a comparison between the different sources
textplot_wordcloud(dfm(corpallsampled, groups = "Source", remove = stopwords("english"), remove_punct=TRUE),
                   min_count = 1000, random_order = FALSE, comparison = TRUE,
                   color = RColorBrewer::brewer.pal(3, "Dark2"))

3- Distributions of word frequencies

A practical way to exhibit words frequencies is to take the sampled dataset, tokenize it, clean it, and barplot its most frequent words. We can show all the words’ frequencies, but that is not very practical since there are thousands of them.

Here we do so, but we barplot the top 20 most frequent words only.

We start by preparing for doing n-gram exploration

#Let's load the profanity list to neutralize it later (this list was referred to through this source: https://stackoverflow.com/questions/3531746/what-s-a-good-python-profanity-filter-library)
profanity <- readLines(url("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"))
close(url("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"))

Now we proceed for tokenization, cleaning, and barplotting

#We clean the dataset now from punctuation, numbers, profanity and stopwords as we only need "meaningful words"
toks0<- tokens (corpallsampled, remove_punct = TRUE, remove_numbers = TRUE)
toks1 <- tokens_remove(toks0, pattern = c(stopwords('en'), profanity))

#make a barplot for the frequencies of the top 20 words (i.e. unigrams) in the sample data 
rn <- row.names.data.frame(as.data.frame(sampletopfeatures))
ggplot(as.data.frame(sampletopfeatures),
       aes(x=reorder(rn, -sampletopfeatures), y=sampletopfeatures))+
      geom_bar(stat="identity")+ scale_y_continuous(limits = c(0,3000))+
      labs(x="Words", y="Frequency", title = "Top 20 Features(i.e. most frequent words) in text sample") +theme_minimal()

4- Frequencies of 2-grams and 3-grams in the dataset

We replicate the same method for Bi-grams and Tri-grams

#make barplot for the frequencies of the top 20 bigrams in the sample data 

toks_ngram2 <- tokens_ngrams(toks1, n = 2, concatenator = " ")
TF_toks_ngram2<- topfeatures(dfm(toks_ngram2), 20)
TF_toks_ngram2_names <- row.names.data.frame(as.data.frame(TF_toks_ngram2))

ggplot(as.data.frame(TF_toks_ngram2),
       aes(x=reorder(TF_toks_ngram2_names, -TF_toks_ngram2), y=TF_toks_ngram2))+
      geom_bar(stat="identity")+ 
      labs(x="Words", y="Frequency", title = "Top 20 Features of 2-grams in text sample")+
      theme_minimal()+theme(axis.text.x = element_text(angle = 90, hjust = 1))

#____________________________________________

#make barplot for the frequencies of the top 20 trigrams in the sample data 
toks_ngram3 <- tokens_ngrams(toks1, n = 3, concatenator = " ")
TF_toks_ngram3<- topfeatures(dfm(toks_ngram3), 20)
TF_toks_ngram3_names <- row.names.data.frame(as.data.frame(TF_toks_ngram3))

ggplot(as.data.frame(TF_toks_ngram3),
       aes(x=reorder(TF_toks_ngram3_names, -TF_toks_ngram3), y=TF_toks_ngram3))+
      geom_bar(stat="identity")+ 
      labs(x="Words", y="Frequency", title = "Top 20 Features of 3-grams in text sample")+
      theme_minimal()+theme(axis.text.x = element_text(angle = 90, hjust = 1))

#____________________________________________

#make barplot for the frequencies of the top 20 four-grams in the sample data 
toks_ngram4 <- tokens_ngrams(toks1, n = 4, concatenator = " ")
TF_toks_ngram4<- topfeatures(dfm(toks_ngram4), 20)
TF_toks_ngram4_names <- row.names.data.frame(as.data.frame(TF_toks_ngram4))

ggplot(as.data.frame(TF_toks_ngram4),
       aes(x=reorder(TF_toks_ngram4_names, -TF_toks_ngram4), y=TF_toks_ngram4))+
      geom_bar(stat="identity")+ 
      labs(x="Words", y="Frequency", title = "Top 20 Features of 4-grams in text sample")+
      theme_minimal()+theme(axis.text.x = element_text(angle = 90, hjust = 1))

5- Unique words in a frequency sorted dictionary needed to cover 50% & 90% of all word instances in the language

This task requires optimization to find the best value for an unknown. This can be done by using the formula \(\frac{sum(topfeatures(dfm(corpallsampled, remove_punct=TRUE), x))}{sum(dfm(corpallsampled, remove_punct=TRUE))} = Coverage Percentage\) where \(x\) is the number of top features as sorted in a frequency dictionary, or, in a more conceptually accessible way;

\(\frac{SumOfFrequenciesOfTopfeatures(x=Number Of Top Features Included)}{SumOfNumbersOfAllWordsInDataset} = CoveragePercentage\)

Since all other variables are given or known, an algorithm should optimize for \(x\) in order to obtain the number of words needed to cover a certain percentage.

But that turned out to be too computationaly expensive, so I prefered a trial-and-error method using the above formula and continuosly playing with \(x\). I may have used Manipulate package to keep this interactively using this code ( this is just for exhibition purposes)

# manipulate(
#       plot(
#             sum(topfeatures(dfm(corpallsampled, remove_punct=TRUE), x))/sum(dfm(corpallsampled, remove_punct=TRUE))
#       ),
#       x = slider(100, 7000, step =15)
# )

In the end, I found that to cover 90% of the dictionary, one needs the top (i.e. most frequent) 6898 words in the sampled dataset, and to cover 50% one needs the top 129 words.

6- Evaluating language

To do this task, we use the cld2 package.

To get the ratio of English-to-foreign languages in the dataset, we identify the languages in the dataset, count the data related to each language, group English alone and foreign language alone, and divide English-language data to the whole mixed data to get a figure of its proportion.

#returns proportion of phrases detected as English (i.e. by taking phrase as a unit of identifying a language, and analyzing each phrase by its own)
table(detect_language(texts(corpallsampled)))["en"] / sum(table(detect_language(texts(corpallsampled))))
##        en 
## 0.9974196
#returns proportion of words detected as English (i.e. by analyzing each word by its own as unit to identify language)
table(detect_language(as.character(toks0)))["en"] / sum(table(detect_language(as.character(toks0))))
##        en 
## 0.8382347
#this is after accounting for profanity and stopwords
table(detect_language(as.character(toks1)))["en"] / sum(table(detect_language(as.character(toks1))))
##        en 
## 0.8333658