This document is made in fulfillment of the first assignment for the Data Science Capstone (Natural Language Processing), one that is asking to conduct Exploratory Data Analysis. To do this, I’ll mainly utilize the quanteda package per course instructor recommendation in the discussions.
I’ll address the following issues:
1- Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
2- Has the data scientist made basic plots, such as histograms to illustrate features of the data?
3- what are the distributions of word frequencies?
4- What are the frequencies of 2-grams and 3-grams in the dataset?
5- How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
6- How do you evaluate how many of the words come from foreign languages?
Let’s start by loading the basic dependencies:
library(quanteda)
library(ggplot2)
library (manipulate)
#This library helps in deciding the language of words
library(cld2)
set.seed(510)
setwd("C:/Users/ttt/Desktop/final/en_US")
options(download.file.method = "libcurl")
Now that I loaded the needed library, let’s start loading rawdata
#read the corpus to the Global environment and add each corpus' source (blog, twitter or news)
corpblogs <- corpus (readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE))
docvars(corpblogs, "Source") <- "Blog"
corptwit <- corpus (readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE))
docvars(corptwit, "Source") <- "Twitter"
corpnews <- corpus (readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE))
docvars(corpnews, "Source") <- "News"
#combining all the corpii together to analyze them at once
corpall <- corpblogs+corptwit+corpnews
This will results in a huge file that is hard for normal computers to process, so we have to later draw a sample for its usage. Also, it is still in its raw form, no cleaning took place. It may need removing numbers, punctuation, stopwords and profanity. this should be kept in mind.
Let us start by Line counts of the raw data.
#See how many lines does this big compounded file contain:
length(texts(corpall))
## [1] 3336695
#Number of lines of the Blogs file
length(texts(corpblogs))
## [1] 899288
#Number of lines of the News file
length(texts(corpnews))
## [1] 77259
#Number of lines of the Twitter file
length(texts(corptwit))
## [1] 2360148
When it comes to Word counts, computational power explodes, and normal computers can’t handle counting huge numbers of words. Therefore, we tend to sample the data in order to work with a more practical dataset.
We start by sampling the data:
#gets a random sample of 1% of all the texts, this is a workable sample that can be used on my device
corpallsampled <- corpus_sample(corpall, size=length(texts(corpall))*0.01)
#let's now see how many lines does this abbreviated file contain:
length(texts(corpallsampled))
## [1] 33366
Now that we have the sampled dataset, let’s count “words”
#gets the words of the sampled dataset
allsampledwords <- sum(ntoken(corpallsampled))
#gets the number of words in the sampled dataset whose source is Twitter
twittersampledwords <- sum(ntoken(corpallsampled[docvars(corpallsampled)=="Twitter"]))
#gets the number of words in the sampled dataset whose source is Blogs
Blogsampledwords <- sum(ntoken(corpallsampled[docvars(corpallsampled)=="Blog"]))
#gets the number of words in the sampled dataset whose source is News
Newssampledwords <- sum(ntoken(corpallsampled[docvars(corpallsampled)=="News"]))
#put them all together in a nice table
sampled_words_count <- data.frame(WordsNum_Twitter= twittersampledwords, WordsNum_Blogs = Blogsampledwords, WordsNum_News = Newssampledwords, Wordsnum_All = allsampledwords)
sampled_words_count
## WordsNum_Twitter WordsNum_Blogs WordsNum_News Wordsnum_All
## 1 369269 421586 29670 820525
Since we sampled 1% only, we can actually easily approximate the numbers in the whole dataset, by multiplying x100
#approximation of number of words in the original large (i.e. non-sampled) dataset
all_words_counts <- sampled_words_count*100
all_words_counts
## WordsNum_Twitter WordsNum_Blogs WordsNum_News Wordsnum_All
## 1 36926900 42158600 2967000 82052500
To explore the data further, we can draw a barplot showing the different sources of texts in the whole dataset
#returns a barplot showing the different sources of the dataset (i.e. the texts)
ggplot(as.data.frame(addmargins(table(docvars(corpall)$Source))), aes(x=reorder(Var1, -Freq), y=Freq/1000))+
geom_bar(stat="identity")+
labs(x="Source", y="Number of texts (in thousands)", title = "Summary of the different sources of the data")+theme_minimal()
As for features extraction, We can extract all the features that are in the dataset, but for exhibition purposes, we only show the top features in our sampled dataset as follows
#we start by inspecting the first 20 texts to get an idea of the raw form of the data.
#We will at the same time clean the sample from any punctuation or stopwords
dfm_sort(dfm(head(corpallsampled, 20), remove = stopwords("english"), remove_punct=TRUE))[,1:10]
## Document-feature matrix of: 20 documents, 10 features (91.0% sparse).
## 20 x 10 sparse Matrix of class "dfm"
## features
## docs love make 4 just need like want goin documents containing
## text1620044 1 0 0 1 0 0 0 0 0 0
## text939689 0 0 0 0 1 1 0 0 0 0
## text1257057 0 0 0 0 1 1 0 0 0 0
## text2041483 0 1 0 0 0 0 1 0 0 0
## text2983671 0 0 0 1 0 0 0 0 0 0
## text1948542 0 1 0 0 0 0 0 2 0 0
## text264202 0 0 0 0 0 0 0 0 2 2
## text1285788 0 0 0 0 0 0 0 0 0 0
## text11182510 0 0 0 0 0 0 0 0 0 0
## text1912615 0 0 0 0 0 0 0 0 0 0
## text1415588 1 0 0 0 0 0 0 0 0 0
## text2271424 0 0 0 0 0 0 0 0 0 0
## text897753 0 0 0 0 0 0 0 0 0 0
## text12497100 0 1 0 0 0 0 0 0 0 0
## text1851 1 0 0 0 0 0 1 0 0 0
## text8423121 0 0 0 0 0 0 0 0 0 0
## text2107264 0 0 0 0 0 0 0 0 0 0
## text1023946 0 0 0 0 0 0 0 0 0 0
## text31704 0 0 3 0 0 0 0 0 0 0
## text724731 0 0 0 0 0 0 0 0 0 0
#We now return the most frequent tokens (or features) in the sampled dataset
sampletopfeatures <- topfeatures(dfm(corpallsampled, remove = stopwords("english"), remove_punct=TRUE), 20)
sampletopfeatures
## just like one can get time love good day now
## 2628 2276 2006 1965 1832 1651 1527 1515 1470 1468
## know new people see go great back think make going
## 1421 1265 1178 1175 1161 1126 1119 1104 1019 999
#Let's now see the 10 most frequent words according with texts' source grouping
addmargins(
dfm_sort(dfm(corpallsampled, groups ="Source" ,remove = stopwords("english"), remove_punct=TRUE))[,1:10]
#this one removes the sum column that might cofnuse things
)[,-11]
## features
## docs just like one can get time love good day now
## Blog 1044 1037 1166 1016 645 889 442 490 520 578
## News 34 34 60 31 25 20 11 17 19 26
## Twitter 1550 1205 780 918 1162 742 1074 1008 931 864
## Sum 2628 2276 2006 1965 1832 1651 1527 1515 1470 1468
We can make something more graphic to reflect features (i.e. most frequent words) eminence using wordclouds as such
#Let's make a word cloud for the sampled dataset
textplot_wordcloud(dfm(corpallsampled, remove = stopwords("english"), remove_punct=TRUE),
min_count = 1000, random_order = FALSE, rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
#Let's make a word cloud for the sampled dataset with a comparison between the different sources
textplot_wordcloud(dfm(corpallsampled, groups = "Source", remove = stopwords("english"), remove_punct=TRUE),
min_count = 1000, random_order = FALSE, comparison = TRUE,
color = RColorBrewer::brewer.pal(3, "Dark2"))
A practical way to exhibit words frequencies is to take the sampled dataset, tokenize it, clean it, and barplot its most frequent words. We can show all the words’ frequencies, but that is not very practical since there are thousands of them.
Here we do so, but we barplot the top 20 most frequent words only.
We start by preparing for doing n-gram exploration
#Let's load the profanity list to neutralize it later (this list was referred to through this source: https://stackoverflow.com/questions/3531746/what-s-a-good-python-profanity-filter-library)
profanity <- readLines(url("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"))
close(url("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"))
Now we proceed for tokenization, cleaning, and barplotting
#We clean the dataset now from punctuation, numbers, profanity and stopwords as we only need "meaningful words"
toks0<- tokens (corpallsampled, remove_punct = TRUE, remove_numbers = TRUE)
toks1 <- tokens_remove(toks0, pattern = c(stopwords('en'), profanity))
#make a barplot for the frequencies of the top 20 words (i.e. unigrams) in the sample data
rn <- row.names.data.frame(as.data.frame(sampletopfeatures))
ggplot(as.data.frame(sampletopfeatures),
aes(x=reorder(rn, -sampletopfeatures), y=sampletopfeatures))+
geom_bar(stat="identity")+ scale_y_continuous(limits = c(0,3000))+
labs(x="Words", y="Frequency", title = "Top 20 Features(i.e. most frequent words) in text sample") +theme_minimal()
We replicate the same method for Bi-grams and Tri-grams
#make barplot for the frequencies of the top 20 bigrams in the sample data
toks_ngram2 <- tokens_ngrams(toks1, n = 2, concatenator = " ")
TF_toks_ngram2<- topfeatures(dfm(toks_ngram2), 20)
TF_toks_ngram2_names <- row.names.data.frame(as.data.frame(TF_toks_ngram2))
ggplot(as.data.frame(TF_toks_ngram2),
aes(x=reorder(TF_toks_ngram2_names, -TF_toks_ngram2), y=TF_toks_ngram2))+
geom_bar(stat="identity")+
labs(x="Words", y="Frequency", title = "Top 20 Features of 2-grams in text sample")+
theme_minimal()+theme(axis.text.x = element_text(angle = 90, hjust = 1))
#____________________________________________
#make barplot for the frequencies of the top 20 trigrams in the sample data
toks_ngram3 <- tokens_ngrams(toks1, n = 3, concatenator = " ")
TF_toks_ngram3<- topfeatures(dfm(toks_ngram3), 20)
TF_toks_ngram3_names <- row.names.data.frame(as.data.frame(TF_toks_ngram3))
ggplot(as.data.frame(TF_toks_ngram3),
aes(x=reorder(TF_toks_ngram3_names, -TF_toks_ngram3), y=TF_toks_ngram3))+
geom_bar(stat="identity")+
labs(x="Words", y="Frequency", title = "Top 20 Features of 3-grams in text sample")+
theme_minimal()+theme(axis.text.x = element_text(angle = 90, hjust = 1))
#____________________________________________
#make barplot for the frequencies of the top 20 four-grams in the sample data
toks_ngram4 <- tokens_ngrams(toks1, n = 4, concatenator = " ")
TF_toks_ngram4<- topfeatures(dfm(toks_ngram4), 20)
TF_toks_ngram4_names <- row.names.data.frame(as.data.frame(TF_toks_ngram4))
ggplot(as.data.frame(TF_toks_ngram4),
aes(x=reorder(TF_toks_ngram4_names, -TF_toks_ngram4), y=TF_toks_ngram4))+
geom_bar(stat="identity")+
labs(x="Words", y="Frequency", title = "Top 20 Features of 4-grams in text sample")+
theme_minimal()+theme(axis.text.x = element_text(angle = 90, hjust = 1))
This task requires optimization to find the best value for an unknown. This can be done by using the formula \(\frac{sum(topfeatures(dfm(corpallsampled, remove_punct=TRUE), x))}{sum(dfm(corpallsampled, remove_punct=TRUE))} = Coverage Percentage\) where \(x\) is the number of top features as sorted in a frequency dictionary, or, in a more conceptually accessible way;
\(\frac{SumOfFrequenciesOfTopfeatures(x=Number Of Top Features Included)}{SumOfNumbersOfAllWordsInDataset} = CoveragePercentage\)
Since all other variables are given or known, an algorithm should optimize for \(x\) in order to obtain the number of words needed to cover a certain percentage.
But that turned out to be too computationaly expensive, so I prefered a trial-and-error method using the above formula and continuosly playing with \(x\). I may have used Manipulate package to keep this interactively using this code ( this is just for exhibition purposes)
# manipulate(
# plot(
# sum(topfeatures(dfm(corpallsampled, remove_punct=TRUE), x))/sum(dfm(corpallsampled, remove_punct=TRUE))
# ),
# x = slider(100, 7000, step =15)
# )
In the end, I found that to cover 90% of the dictionary, one needs the top (i.e. most frequent) 6898 words in the sampled dataset, and to cover 50% one needs the top 129 words.
To do this task, we use the cld2 package.
To get the ratio of English-to-foreign languages in the dataset, we identify the languages in the dataset, count the data related to each language, group English alone and foreign language alone, and divide English-language data to the whole mixed data to get a figure of its proportion.
#returns proportion of phrases detected as English (i.e. by taking phrase as a unit of identifying a language, and analyzing each phrase by its own)
table(detect_language(texts(corpallsampled)))["en"] / sum(table(detect_language(texts(corpallsampled))))
## en
## 0.9974196
#returns proportion of words detected as English (i.e. by analyzing each word by its own as unit to identify language)
table(detect_language(as.character(toks0)))["en"] / sum(table(detect_language(as.character(toks0))))
## en
## 0.8382347
#this is after accounting for profanity and stopwords
table(detect_language(as.character(toks1)))["en"] / sum(table(detect_language(as.character(toks1))))
## en
## 0.8333658