Load Data

Load sample lines from en_US.twitter.txt

file_path <- "./Coursera-SwiftKey/final/en_US/"
con <- file(paste0(file_path, "en_US.twitter.txt"), "r") 

readLines(con, 1) ## Read the first line of text 
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
readLines(con, 1) ## Read the next line of text 
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
readLines(con, 5) ## Read in the next 5 lines of text 
## [1] "they've decided its more fun if I don't."                                                             
## [2] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                 
## [3] "Words from a complete stranger! Made my birthday even better :)"                                      
## [4] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"                        
## [5] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
close(con) ## It's important to close the connection when you are done. 

Data Exploration

Tokenization

Identifying appropriate tokens such as words, punctuation, and numbers. Breaking each line into smaller pieces. Note that symbols and numbers are removed automatically.

Basic Information about the corpus

Word counts and Line counts

## [1] "There are 29633678 words and 2360148 lines in the corpus."

Rankings and Histograms

Words Frequency: Any position in a line

First let’s take a look the frequency of a single word.
Top 10 Words Frequency: Any position in a line
ngram freq
the 937467
to 788663
i 723548
a 611407
you 548164
and 438541
for 385357
in 380383
of 359636
is 358787

From the histogram, We could see that the frequency is very skewed with a wide range.

Words Frequency: First one in a line

Now let’s take a look the frequency of the First Word.

Top 10 Words Frequency: First one in a line
first_word freq
i 207633
thanks 52266
the 44893
you 40155
i’m 38130
rt 36509
just 29925
my 27231
if 26255
what 25669

We could see that the frequency is very skewed with a wide range.

2-grams Frequency

Now let’s take a look the frequency of the 2-word phrases.
Top 10 2-grams Frequency
ngram freq
in the 78457
for the 73969
of the 56960
on the 48542
to be 47097
to the 43440
thanks for 43007
at the 37245
i love 35925
going to 34277

From the histogram, We could see that the frequency distribution is very skewed with a wide range.

3-grams Frequency

Now let’s take a look the frequency of the 3-word phrases.
Top 10 3-grams Frequency
ngram freq
thanks for the 23621
looking forward to 8833
thank you for 8687
i love you 8423
for the follow 7932
going to be 7416
can’t wait to 7346
i want to 7118
a lot of 6250
to be a 5995

From the histogram, We could see that the frequency distribution is very skewed with a wide range.

Next Steps

To build models that would predict which word is going to be typed, so that questions below could be answered by the models.
(1) In a blank input box, predict which words are most likely to be typed in first based on the frequency of the first word in a line we have;
(2) After the user has input one word, predict which word will be typed next based on the frequency of the 2-grams we have;
(3) After the user has input two word, predict which word will be typed next based on the frequecny of the 3-grams we have.