Introduction

The goal of the milestone project is to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. The contents of this report are:

  1. Download the data from the link provided and successfully loaded it in.

  2. Exploratory analysis - data cleaning and preprocessing, summary statistics and descriptive plots for each file (blogs, news, twitter) and combined.

  3. Report findings about the data, such as most frequent words (one, two and three words) and coverage of number of unique words that comprises 50% and 90% of the corpus.

  4. Next steps to create a prediction algorithm.

Reading Text Files

Exploratory Analysis

Summary of descriptive statistics for each collection (blogs, news and tweets)

Blogs News Twitter
Line Count 899,288.0 1,010,242.0 2,360,148.0
Word Count 37,570,839.0 34,494,539.0 30,451,170.0
Min. words per line 0.0 1.0 1.0
Average words per line 41.8 34.4 12.8
Max. words per line 6,726.0 1,796.0 47.0

Cleaning the data

Using the quanteda package to convert all text to lower case, replace special characters between words such as hyphens, apostrophes, slashes, etc. with spaces, and punctuation (periods, asking, exclamation). The profanity (“bad words”) set is downloaded from this link. A separate dataset removing “stopwords” is created to compare frequency of original datasets with no “stopwords” removed.

Blogs

Sample a fraction (10%) of the complete file to speed up the processing time.

Cleaning blogs_sample with textclean and tm libraries.

Create blog_dfm (document-feature matrix) and list top 20 words by frequency using corpus_dfm and corpus_dfm_nsw functions (see Appendix I). The frequency tables illustrate word frequency with and without stopwords. Stopwords are common words, such as ‘a’, ‘the’, ‘and’, etc., that generally are not indexed or searchable in a search engine (source: Collins Dictionary)

Top 20 Words with Stopwords
Word Frequency
the 186,317
and 109,420
to 107,498
a 90,249
of 88,114
i 84,754
in 59,741
is 51,224
that 47,354
it 44,206
for 36,594
you 30,574
with 28,771
was 28,464
my 27,482
on 27,445
this 26,021
not 26,018
have 24,765
as 22,361
Top 20 Words without Stopwords
Word Frequency
one 12,508
can 10,863
just 10,069
like 9,639
time 9,008
get 7,140
now 6,122
people 6,092
know 5,874
us 5,518
also 5,488
new 5,487
even 5,237
day 5,073
see 5,059
first 5,048
really 5,047
back 5,007
make 4,959
well 4,953



Plot top 20 words for Blog sample with and without stopwords (see Appendix II for function code)

Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary (see Appendix III). In this case, the Blog corpus with and without the stopwords are used.

From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 90. For 90%, the number of words is 5,331. If the stopwords are excluded, the 50% and 90% coverage increases to 888 and 11,200 respectively.

Create blog_dfm_n2 (ngram = 2) and and list top 20 words by frequency

Word Frequency
of the 18,656
in the 15,677
it is 8,777
to the 8,550
i am 8,031
on the 7,486
to be 6,819
i have 6,789
for the 5,858
and the 5,808
and i 5,407
is a 5,371
i was 5,188
it was 4,891
at the 4,817
in a 4,558
with the 4,424
that i 4,152
from the 3,742
do not 3,635

Create blog_dfm_n3 (ngram = 3) and and list top 20 words by frequency

Word Frequency
i do not 1,397
one of the 1,393
a lot of 1,204
i have been 974
it is a 969
i am not 887
i did not 744
some of the 718
as well as 715
there is a 684
to be a 681
be able to 667
it was a 658
out of the 643
the end of 628
a couple of 620
it is not 579
i want to 577
if you are 570
and i am 553

News

The same process shown in the blogs file will be applied to the news file. Only the output will be shown.

List top 20 words by frequency (with and without stopwords)
Top 20 Words with Stopwords
Word Frequency
the 196,845
to 89,485
and 88,075
a 87,699
of 76,923
in 66,968
is 40,959
that 36,700
for 35,106
it 27,410
on 26,666
with 25,150
said 25,111
he 25,015
was 23,552
not 21,290
at 21,237
as 18,620
i 18,329
are 17,097
Top 20 Words without Stopwords
Word Frequency
said 25,111
one 8,315
new 6,950
can 6,883
also 5,891
two 5,839
year 5,511
just 5,316
first 5,196
years 5,172
last 5,114
time 5,053
like 4,977
state 4,826
people 4,655
get 4,434
us 4,323
city 3,707
now 3,607
school 3,534



Plot top 20 words for News sample with and without stopwords

Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the News corpus with and without the stopwords are used.

From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 155. For 90%, the number of words is 6,958. If the stopwords are excluded, the 50% and 90% coverage increases to 1,128 and 12,320 respectively.

Create news_dfm_n2 (ngram = 2) and and list top 20 features by frequency
Word Frequency
of the 18,657
in the 17,777
to the 8,313
it is 7,381
on the 7,308
for the 6,957
at the 5,903
in a 5,160
and the 5,125
to be 4,686
is a 4,399
with the 4,303
from the 3,663
with a 3,415
he said 3,407
of a 3,336
for a 3,159
will be 3,073
as a 3,051
that is 2,935
Create news_dfm_n3 (ngram = 3) and and list top 20 features by frequency
Word Frequency
one of the 1,503
a lot of 1,170
it is a 967
i do not 761
it is not 712
as well as 616
part of the 576
according to the 566
the end of 552
to be a 548
out of the 542
some of the 526
there is a 522
in the first 515
going to be 511
is going to 500
are going to 442
be able to 431
the united states 428
it was a 419

Twitter

The same process shown in the blogs and news files will be applied to the twitter file. Only the output will be shown.

List top 20 words by frequency (with and without stopwords)
Top 20 Words with Stopwords
Word Frequency
the 93,825
i 89,651
to 79,159
a 60,896
you 59,835
is 54,239
and 43,369
for 38,248
it 37,982
in 37,584
of 35,962
not 32,300
my 29,048
on 27,690
that 26,626
are 22,664
have 20,242
me 20,173
at 18,758
be 18,681
Top 20 Words without Stopwords
Word Frequency
just 14,860
can 13,430
like 12,278
get 11,278
love 10,523
good 9,889
day 9,064
rt 8,985
thanks 8,737
now 8,191
one 8,180
know 7,927
u 7,732
great 7,586
time 7,544
go 7,213
today 7,036
new 6,969
see 6,665
lol 6,589



Plot top 20 words for Twitter sample with and without stopwords

Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the Twitter corpus with and without the stopwords are used.

From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 89. For 90%, the number of words is 3,801. If the stopwords are excluded, the 50% and 90% coverage increases to 465 and 7,602 respectively.

Create twitter_dfm_n2 (ngram = 2) and and list top 20 words by frequency
Word Frequency
i am 15,734
it is 10,109
do not 7,917
in the 7,817
for the 7,252
you are 6,120
of the 5,789
i have 5,274
on the 4,921
to be 4,691
can not 4,610
that is 4,480
to the 4,421
is a 4,417
i will 4,330
thanks for 4,147
if you 3,864
at the 3,849
will be 3,731
i love 3,513
Create twitter_dfm_n3 (ngram = 3) and and list top 20 words by frequency
Word Frequency
i do not 2,468
thanks for the 2,275
i can not 1,502
can not wait 1,397
i am not 1,245
it is a 934
i will be 930
looking forward to 881
thank you for 878
i love you 867
you do not 845
i am so 765
going to be 756
do not know 746
if you are 731
i did not 720
not wait to 719
for the follow 701
i am going 693
is going to 684

Combined datasets

Combining the samples from blogs, news and twitter datasets already cleaned and preprocessed.

List top 20 words by frequency (with and without stopwords)
Top 20 Words with Stopwords
Word Frequency
the 476,987
to 276,142
and 240,864
a 238,844
of 200,999
i 192,734
in 164,293
is 146,422
that 110,680
for 109,948
it 109,598
you 101,644
on 81,801
not 79,608
with 71,106
was 64,237
have 61,154
my 60,693
are 60,582
at 57,337
Top 20 Words without Stopwords
Word Frequency
can 31,176
said 30,630
just 30,245
one 29,003
like 26,894
get 22,852
time 21,605
new 19,406
now 17,920
good 17,670
day 16,920
us 16,138
know 16,037
love 15,979
people 15,776
back 13,994
go 13,931
see 13,852
first 13,372
also 12,938



Plot top 20 words for combined samples with and without stopwords

Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the combined samples with and without the stopwords are used.

From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 114. For 90%, the number of words is 6,487. If the stopwords are excluded, the 50% and 90% coverage increases to 960 and 13,299 respectively.

Create comb_dfm_n2 (ngram = 2) and and list top 20 words by frequency
Word Frequency
of the 43,102
in the 41,271
it is 26,267
i am 25,916
to the 21,284
for the 20,067
on the 19,715
to be 16,196
at the 14,569
do not 14,372
is a 14,187
i have 13,437
and the 12,426
in a 12,006
with the 10,578
that is 10,313
it was 9,965
will be 9,901
you are 9,605
for a 9,604
Create comb_dfm_n3 (ngram = 3) and and list top 20 words by frequency
Word Frequency
i do not 4,626
one of the 3,478
a lot of 2,955
it is a 2,870
i am not 2,464
thanks for the 2,297
i can not 2,160
it is not 1,931
i have been 1,841
to be a 1,835
going to be 1,775
i did not 1,723
there is a 1,696
is going to 1,591
if you are 1,560
can not wait 1,549
you do not 1,495
do not know 1,478
i will be 1,461
the end of 1,460

Conclusions and next steps

When analyzed independently, the number of words that are common to the text files (blogs, news, and twitter) increase with the n-grams size. Also, the cumulative frequencies for number of words that comprise the individual corpus increase significantly when stopwords are excluded. Therefore, the combined samples corpus n-grams and word coverage is a good starting dataset for our prediction model.

The next steps involve developing a prediction algorithm and deploy it with a Shiny app. For the observed n-grams, Markov chains will be used, considering the specific property that it does not take into account past events (i.e. only applies probabilities from the current state to a next state). This behavior is applicable to text prediction. For the unobserved n-grams, backoff models will be implemented to estimate the probability of such event.

Appendix Section