Coursera JHU Capstone Milestone Report

Introduction

The goal of the milestone project is to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. The contents of this report are:

Download the data from the link provided and successfully loaded it in.
Exploratory analysis - data cleaning and preprocessing, summary statistics and descriptive plots for each file (blogs, news, twitter) and combined.
Report findings about the data, such as most frequent words (one, two and three words) and coverage of number of unique words that comprises 50% and 90% of the corpus.
Next steps to create a prediction algorithm.

Download files from link

Setup working directory

setwd("/Users/ricsam/Dropbox/DS_Capstone")

if(!file.exists("./data")){dir.create("./data")}
if(!file.exists("./data/datasets.zip")){
        file_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        download.file(file_url, destfile="./data/datasets.zip")
        unzip("./data/datasets.zip", exdir = "./data")
}

Reading Text Files

path = "./data"
blogs <- readLines(paste(path, "/final/en_US/en_US.blogs.txt", sep = ""), encoding="UTF-8", skipNul = TRUE, warn = FALSE)
news <- readLines(paste(path, "/final/en_US/en_US.news.txt", sep = ""), encoding="UTF-8", skipNul = TRUE, warn = FALSE)
twitter <- readLines(paste(path, "/final/en_US/en_US.twitter.txt", sep = ""), encoding="UTF-8", skipNul = TRUE, warn = FALSE)

Exploratory Analysis

Summary of descriptive statistics for each collection (blogs, news and tweets)

data <- list(blogs, news, twitter)
line_count <- sapply(data, stri_stats_general)[c('Lines'), ]
word_count <- sapply(data, stri_stats_latex)[c('Words'), ]
word_min_mean_max <- sapply(data, function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
data_summ <- data.frame(rbind(line_count, word_count, word_min_mean_max))
colnames(data_summ) <- c("Blogs", "News", "Twitter")
rownames(data_summ) <- c("Line Count", "Word Count", "Min. words per line", "Average words per line", "Max. words per line")
data_summ %>%
     kable(digits = 1, format.args = list(big.mark = ",", 
                                          scientific = FALSE), 
           align = "c") %>% 
        kable_styling(position = "center")

	Blogs	News	Twitter
Line Count	899,288.0	1,010,242.0	2,360,148.0
Word Count	37,570,839.0	34,494,539.0	30,451,170.0
Min. words per line	0.0	1.0	1.0
Average words per line	41.8	34.4	12.8
Max. words per line	6,726.0	1,796.0	47.0

Cleaning the data

Using the quanteda package to convert all text to lower case, replace special characters between words such as hyphens, apostrophes, slashes, etc. with spaces, and punctuation (periods, asking, exclamation). The profanity (“bad words”) set is downloaded from this link. A separate dataset removing “stopwords” is created to compare frequency of original datasets with no “stopwords” removed.

bad_words <- readLines("en_bad_words.txt", warn = FALSE, skipNul = TRUE)

Blogs

Sample a fraction (10%) of the complete file to speed up the processing time.

set.seed(222)
blog_sample <- sample(blogs, length(blogs) * 0.1)

Cleaning blogs_sample with textclean and tm libraries.

blog_sample <- replace_contraction(blog_sample)
blog_sample <- replace_hash(blog_sample)
blog_sample <- replace_html(blog_sample)
blog_sample <- replace_url(blog_sample)
blog_sample <- tolower(blog_sample)
blog_sample <- removePunctuation(blog_sample, ucp = TRUE)
blog_sample <- removeWords(blog_sample, bad_words)
blog_sample <- replace_white(blog_sample)

Create blog_dfm (document-feature matrix) and list top 20 words by frequency using corpus_dfm and corpus_dfm_nsw functions (see Appendix I). The frequency tables illustrate word frequency with and without stopwords. Stopwords are common words, such as ‘a’, ‘the’, ‘and’, etc., that generally are not indexed or searchable in a search engine (source: Collins Dictionary)

blog_dfm <- corpus_dfm(blog_sample, 1)
blog_dfm_20 <- textstat_frequency(blog_dfm, 20)

blog_dfm_nsw <- corpus_dfm_nsw(blog_sample, 1)
blog_dfm_nsw_20 <- textstat_frequency(blog_dfm_nsw, 20)

Top 20 Words with Stopwords
Word	Frequency
the	186,317
and	109,420
to	107,498
a	90,249
of	88,114
i	84,754
in	59,741
is	51,224
that	47,354
it	44,206
for	36,594
you	30,574
with	28,771
was	28,464
my	27,482
on	27,445
this	26,021
not	26,018
have	24,765
as	22,361

Top 20 Words without Stopwords
Word	Frequency
one	12,508
can	10,863
just	10,069
like	9,639
time	9,008
get	7,140
now	6,122
people	6,092
know	5,874
us	5,518
also	5,488
new	5,487
even	5,237
day	5,073
see	5,059
first	5,048
really	5,047
back	5,007
make	4,959
well	4,953

Plot top 20 words for Blog sample with and without stopwords (see Appendix II for function code)

g1 <- top_20_barchart(blog_dfm_20, "Top 20 (Blog with stopwords)")
g2 <- top_20_barchart(blog_dfm_nsw_20, "Top 20 (Blog without stopwords)")
grid.arrange(g1, g2, nrow = 1)

Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary (see Appendix III). In this case, the Blog corpus with and without the stopwords are used.

blog_freq_all <- textstat_frequency(blog_dfm)
blog_freq_all <- blog_freq_all %>% 
        mutate(Freqprc = (frequency / sum(frequency)) * 100, 
               Freqcum = cumsum(Freqprc))

blog_freq <- textstat_frequency(blog_dfm_nsw)
blog_freq <- blog_freq %>% 
        mutate(Freqprc = (frequency / sum(frequency)) * 100, 
               Freqcum = cumsum(Freqprc))

c1 <- cum_freq_plot(blog_freq_all, "Cumulative Frequency Plot (Blog)", "(with stopwords)")

c2 <- cum_freq_plot(blog_freq, "Cumulative Frequency Plot (Blog)", "(without stopwords)")
grid.arrange(c1, c2, nrow = 1)

From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 90. For 90%, the number of words is 5,331. If the stopwords are excluded, the 50% and 90% coverage increases to 888 and 11,200 respectively.

Create blog_dfm_n2 (ngram = 2) and and list top 20 words by frequency

# create tokens from blog sample
blog_dfm_n2 <- corpus_dfm(blog_sample, 2)
blog_dfm_n2_20 <- textstat_frequency(blog_dfm_n2, 20)
blog_dfm_n2_20 %>% 
     select(feature, frequency) %>% 
     kable(col.names = c("Word", "Frequency"), format.args = list(big.mark = ",", 
                                          scientific = FALSE)) %>% 
        kable_styling(full_width = F, position = "center")

Word	Frequency
of the	18,656
in the	15,677
it is	8,777
to the	8,550
i am	8,031
on the	7,486
to be	6,819
i have	6,789
for the	5,858
and the	5,808
and i	5,407
is a	5,371
i was	5,188
it was	4,891
at the	4,817
in a	4,558
with the	4,424
that i	4,152
from the	3,742
do not	3,635

Create blog_dfm_n3 (ngram = 3) and and list top 20 words by frequency

# create tokens from blog sample
blog_dfm_n3 <- corpus_dfm(blog_sample, 3)
blog_dfm_n3_20 <- textstat_frequency(blog_dfm_n3, 20)
blog_dfm_n3_20 %>% 
     select(feature, frequency) %>% 
     kable(col.names = c("Word", "Frequency"), format.args = list(big.mark = ",", 
                                          scientific = FALSE)) %>% 
        kable_styling(full_width = F, position = "center")

Word	Frequency
i do not	1,397
one of the	1,393
a lot of	1,204
i have been	974
it is a	969
i am not	887
i did not	744
some of the	718
as well as	715
there is a	684
to be a	681
be able to	667
it was a	658
out of the	643
the end of	628
a couple of	620
it is not	579
i want to	577
if you are	570
and i am	553

News

The same process shown in the blogs file will be applied to the news file. Only the output will be shown.

List top 20 words by frequency (with and without stopwords)

Top 20 Words with Stopwords
Word	Frequency
the	196,845
to	89,485
and	88,075
a	87,699
of	76,923
in	66,968
is	40,959
that	36,700
for	35,106
it	27,410
on	26,666
with	25,150
said	25,111
he	25,015
was	23,552
not	21,290
at	21,237
as	18,620
i	18,329
are	17,097

Top 20 Words without Stopwords
Word	Frequency
said	25,111
one	8,315
new	6,950
can	6,883
also	5,891
two	5,839
year	5,511
just	5,316
first	5,196
years	5,172
last	5,114
time	5,053
like	4,977
state	4,826
people	4,655
get	4,434
us	4,323
city	3,707
now	3,607
school	3,534

Plot top 20 words for News sample with and without stopwords

Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the News corpus with and without the stopwords are used.

From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 155. For 90%, the number of words is 6,958. If the stopwords are excluded, the 50% and 90% coverage increases to 1,128 and 12,320 respectively.

Create news_dfm_n2 (ngram = 2) and and list top 20 features by frequency

Word	Frequency
of the	18,657
in the	17,777
to the	8,313
it is	7,381
on the	7,308
for the	6,957
at the	5,903
in a	5,160
and the	5,125
to be	4,686
is a	4,399
with the	4,303
from the	3,663
with a	3,415
he said	3,407
of a	3,336
for a	3,159
will be	3,073
as a	3,051
that is	2,935

Create news_dfm_n3 (ngram = 3) and and list top 20 features by frequency

Word	Frequency
one of the	1,503
a lot of	1,170
it is a	967
i do not	761
it is not	712
as well as	616
part of the	576
according to the	566
the end of	552
to be a	548
out of the	542
some of the	526
there is a	522
in the first	515
going to be	511
is going to	500
are going to	442
be able to	431
the united states	428
it was a	419

Twitter

The same process shown in the blogs and news files will be applied to the twitter file. Only the output will be shown.

List top 20 words by frequency (with and without stopwords)

Top 20 Words with Stopwords
Word	Frequency
the	93,825
i	89,651
to	79,159
a	60,896
you	59,835
is	54,239
and	43,369
for	38,248
it	37,982
in	37,584
of	35,962
not	32,300
my	29,048
on	27,690
that	26,626
are	22,664
have	20,242
me	20,173
at	18,758
be	18,681

Top 20 Words without Stopwords
Word	Frequency
just	14,860
can	13,430
like	12,278
get	11,278
love	10,523
good	9,889
day	9,064
rt	8,985
thanks	8,737
now	8,191
one	8,180
know	7,927
u	7,732
great	7,586
time	7,544
go	7,213
today	7,036
new	6,969
see	6,665
lol	6,589

Plot top 20 words for Twitter sample with and without stopwords

Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the Twitter corpus with and without the stopwords are used.

From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 89. For 90%, the number of words is 3,801. If the stopwords are excluded, the 50% and 90% coverage increases to 465 and 7,602 respectively.

Create twitter_dfm_n2 (ngram = 2) and and list top 20 words by frequency

Word	Frequency
i am	15,734
it is	10,109
do not	7,917
in the	7,817
for the	7,252
you are	6,120
of the	5,789
i have	5,274
on the	4,921
to be	4,691
can not	4,610
that is	4,480
to the	4,421
is a	4,417
i will	4,330
thanks for	4,147
if you	3,864
at the	3,849
will be	3,731
i love	3,513

Create twitter_dfm_n3 (ngram = 3) and and list top 20 words by frequency

Word	Frequency
i do not	2,468
thanks for the	2,275
i can not	1,502
can not wait	1,397
i am not	1,245
it is a	934
i will be	930
looking forward to	881
thank you for	878
i love you	867
you do not	845
i am so	765
going to be	756
do not know	746
if you are	731
i did not	720
not wait to	719
for the follow	701
i am going	693
is going to	684

Combined datasets

Combining the samples from blogs, news and twitter datasets already cleaned and preprocessed.

comb_sample <- c(blog_sample, news_sample, twitter_sample)

List top 20 words by frequency (with and without stopwords)

Top 20 Words with Stopwords
Word	Frequency
the	476,987
to	276,142
and	240,864
a	238,844
of	200,999
i	192,734
in	164,293
is	146,422
that	110,680
for	109,948
it	109,598
you	101,644
on	81,801
not	79,608
with	71,106
was	64,237
have	61,154
my	60,693
are	60,582
at	57,337

Top 20 Words without Stopwords
Word	Frequency
can	31,176
said	30,630
just	30,245
one	29,003
like	26,894
get	22,852
time	21,605
new	19,406
now	17,920
good	17,670
day	16,920
us	16,138
know	16,037
love	15,979
people	15,776
back	13,994
go	13,931
see	13,852
first	13,372
also	12,938

Plot top 20 words for combined samples with and without stopwords

Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the combined samples with and without the stopwords are used.

From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 114. For 90%, the number of words is 6,487. If the stopwords are excluded, the 50% and 90% coverage increases to 960 and 13,299 respectively.

Create comb_dfm_n2 (ngram = 2) and and list top 20 words by frequency

Word	Frequency
of the	43,102
in the	41,271
it is	26,267
i am	25,916
to the	21,284
for the	20,067
on the	19,715
to be	16,196
at the	14,569
do not	14,372
is a	14,187
i have	13,437
and the	12,426
in a	12,006
with the	10,578
that is	10,313
it was	9,965
will be	9,901
you are	9,605
for a	9,604

Create comb_dfm_n3 (ngram = 3) and and list top 20 words by frequency

Word	Frequency
i do not	4,626
one of the	3,478
a lot of	2,955
it is a	2,870
i am not	2,464
thanks for the	2,297
i can not	2,160
it is not	1,931
i have been	1,841
to be a	1,835
going to be	1,775
i did not	1,723
there is a	1,696
is going to	1,591
if you are	1,560
can not wait	1,549
you do not	1,495
do not know	1,478
i will be	1,461
the end of	1,460

Conclusions and next steps

When analyzed independently, the number of words that are common to the text files (blogs, news, and twitter) increase with the n-grams size. Also, the cumulative frequencies for number of words that comprise the individual corpus increase significantly when stopwords are excluded. Therefore, the combined samples corpus n-grams and word coverage is a good starting dataset for our prediction model.

The next steps involve developing a prediction algorithm and deploy it with a Shiny app. For the observed n-grams, Markov chains will be used, considering the specific property that it does not take into account past events (i.e. only applies probabilities from the current state to a next state). This behavior is applicable to text prediction. For the unobserved n-grams, backoff models will be implemented to estimate the probability of such event.

Appendix Section

Appendix I - Function to create corpus_dfm (document-feature matrix) including stopwords and corpus__dfm_nsw excluding stopwords

corpus_dfm <- function(x, n) {
     text <- dfm(x, remove_numbers = TRUE, remove_punct = TRUE, 
          remove_symbols = TRUE, remove_separators = TRUE, remove_twitter = TRUE, 
          ngrams = n, concatenator = " ")
     text <- dfm_compress(text)
     text <- dfm_trim(text, min_docfreq = 2)
     return(text)
}

corpus_dfm_nsw <- function(x, n) {
     text <- dfm(x, remove = stopwords("english"), remove_numbers = TRUE, 
                 remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, 
                 remove_separators = TRUE, ngrams = n, concatenator = " ")
     text <- dfm_compress(text)
     text <- dfm_trim(text, min_docfreq = 2)
     return(text)
}

Appendix II - Function to plot bar charts (top 20 frequency words)

# using RColor Brewer expanded for frequency term plot
colorCount = 20
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
top_20_barchart <- function(df, title) {
     ggplot(df, aes(x = reorder(feature, frequency), y = frequency, fill = feature)) + 
          geom_bar(stat = "identity") + 
          scale_fill_manual(values = getPalette(colorCount)) + 
          labs(title = title, x = "Word", y = "Total Count") + 
          coord_flip() + 
          theme(plot.title = element_text(hjust = 0.5), 
                legend.position = "none")
}

Appendix III - Function to plot cumulative frequencies and determine the number of words that comprises 50% and 90% of the total corpus

cum_freq_plot <- function(df, title, subtitle) {
        # function for extrcting intersection value from y
        f1 <- approxfun(df$Freqcum, df$rank)
        f50 <- as.integer(f1(50))
        f90 <- as.integer(f1(90))
        ggplot(df, aes(x = as.numeric(row.names(df)), y = Freqcum)) + 
     geom_line() + 
        geom_hline(yintercept = 50, linetype = "dashed", color = "red") +  
        geom_hline(yintercept = 90, linetype = "dashed", color = "blue") + 
     geom_vline(xintercept = f50, linetype = "dashed") + 
        geom_text(aes(x = f50, y = 10, label = f50), 
                  color = "red", angle = 90, size = 3, nudge_x = -0.05) + 
     geom_vline(xintercept = f90, linetype = "dashed") + 
        geom_text(aes(x = f90, y = 10, label = f90), 
                  color = "blue", angle = 90, size = 3, nudge_x = -0.05) + 
     scale_x_log10(limits = c(1, nrow(df)), 
                   breaks = c(50, 100, 5000, 10000)) +
     scale_y_log10(breaks = c(25, 50, 80, 90, 95)) + 
     labs(title = title, subtitle = subtitle, 
          x = "Number of Words", 
          y = "Cumulative Frequency (%)")
}

Coursera JHU Capstone Milestone Report

Ricardo J. Serrano

12/16/2019

Introduction

Download files from link

Reading Text Files

Exploratory Analysis

Cleaning the data

Blogs

News

Twitter

Combined datasets

Conclusions and next steps

Appendix Section

Appendix I - Function to create corpus_dfm (document-feature matrix) including stopwords and corpus__dfm_nsw excluding stopwords

Appendix II - Function to plot bar charts (top 20 frequency words)

Appendix III - Function to plot cumulative frequencies and determine the number of words that comprises 50% and 90% of the total corpus