Data science specialization capstone

Context

The objective of this document is to summarize the exploratory analysis made on the corpus datase, this is one of the first steps in building the shiny application to predict words of a sentences. By last, the next steps in building the shiny app will be presented answering the questions made in the modeling task. Thus, this report won’t contain much of the code made in the exploratory analysis, it will be in the final notes, on the source code link.

Pre processing step.

The code link is supplied given that the preprocessing step is too large to include in this report. In resumen, the preproccesing contains the following steps:

Removing the punctuation signs on each sample
Tokenization of each corpus
Removing numbers, symbols, separations, hypens
Transform all the words into lowercase
Removing stopwords
Stemming all the words
Removing all the leftover unique letters

Information about the files

The following table shows a summary of the files

kable(words_stats)

File	Size in Megabytes	Number of lines	Sample size	Number of sample lines
blog	248.49350	899288	0.5576248	2000
news	19.17972	77259	0.4840775	2000
twitter	301.39694	2360148	0.2588806	2000

Word frequencies

First we are going to load the bag of words for an NGram model of N = 1 and the tokenized models of the words, from the news, twitter and blog data.

load("./Milestone report/dfm_matrix.RData")
load("./Milestone report/tokens.RData")

Let’s first look the distribution of the number of words per source

load("./Milestone report/nchar_all.RData")
ggplot(nchar_all, aes(x = number_words, fill = var))+ 
      geom_histogram(alpha = 0.5, bins = 100) +
      labs(x = "Number of characters", y = "Count", title = "Number of character per document")

As seen, the source with more independent words is the blog data. This is a bit surprising because from the preprocessing code, the same number of documents were taken from the corpus of each source, and the corpus with more expected words was the news source. And as seen the twitter source is the one with less words, this was expected given the configuration of the social network.

Wordclouds are usefull for lookin at the datasets in an independet way.

wordcloud(unlist(blog_tokens), scale = c(3,.05),max.words = 150,
          rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
          colors=brewer.pal(8, "Dark2"))

One of the most seen words are the time related, like week, year, day, this could be related from the nature of the blogs that the words were taken.

wordcloud(unlist(news_tokens), scale = c(3,.05),max.words = 150,
          rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
          colors=brewer.pal(8, "Dark2"))

In the news wordcloud, “said” has high frequency, it gives an idea that most of the reporters quote what othe people said related to the news that they are reporting

wordcloud(unlist(twit_tokens), scale = c(2.5,.05),max.words = 150,
          rot.per = .5,random.order=FALSE, use.r.layout = FALSE,
          colors=brewer.pal(8, "Dark2"))

The following table presents the 15 most frequent words from the datasets and the from the merged dataset:

Blog words	Blog freq	News words	News freq	Twitter words	Twitter freq	Merge words	Merge words freq
one	299	said	506	thank	127	said	602
like	288	year	251	like	124	one	543
get	241	one	170	get	115	like	523
just	240	time	145	just	114	get	478
time	234	can	136	go	112	year	464
can	215	new	131	day	106	time	462
make	206	state	124	love	106	just	461
go	205	go	124	good	86	go	441
year	175	two	123	time	83	can	429
love	166	get	122	follow	80	make	377
work	164	say	122	know	78	day	357
day	163	first	118	can	78	work	331
know	160	like	111	great	76	new	316
thing	159	citi	109	rt	74	good	300
think	159	just	107	one	74	love	294

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

The total unique words of the sample corpus are 14016. To cover 50% of all word instances in the language there are needed only 604, that is equivalent to 4.3093607 percent of all the unique words, and to cover 90%, there are needed 5694 equivalent to 40.625 percent of the total words.

Bi-gram and Tri-gram models

For this task, we will work only with the full merged dataset taken from the samples sets of each corpus.

Bi-gram

words	freq
year old	50
last year	43
new york	36
right now	33
high school	32
last week	32
year ago	31
make sure	27
even though	26
look like	26
feel like	22
two year	22
first time	21
everi day	20
unit state	20

There is an interesting finding of the most used bi-gram, “year old” and “last year”, there is a lot of use of time reference words.

Tri-gram

words	freq
omg omg omg	14
new york citi	5
coupl year ago	4
presid barack obama	4
san diego state	4
mum mum mum	4
protect inform bill	3
two year ago	3
question whether can	3
12th grade foothil	3
grade foothil high	3
foothil high school	3
pleas let know	3
fanni mae freddi	3
mae freddi mac	3

One point of interest is the most used trigram, that is an expression for “oh my god” followed by “new york city”

Further points of interest

How do you evaluate how many of the words come from foreign languages?
A/ By crossing the bag of words with an english dictionary, it’s not the most accuarate technique, but it could give you an idea of how many words could be of foreign language.

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

A/ There are 2 options, and could be used simultaneously.
1. Add new words into the dictionary 2. Add the word “Not_on_dictionary” or anyway you want to call it, so instead of findings NULLS o NaNs when crossing the corpora with a dictionary, it could replace it with “Not_on_dictionary” so it could increase the coverage.

Findings

It’s important to take a tiny subset of the corpora, because it is to heavy to load it on a single PC, or use parallel processing with other machines.
As the n-gram models increases the number of words, it’s more easy to understand the context of the documents, or it give a sligthly idea of it, no like the unigram model, that it’s most used for tasks like sentiment analysis.

Data science specialization capstone - Milestone report

Alejandro Cadavid Romero