Load sample lines from en_US.twitter.txt
file_path <- "./Coursera-SwiftKey/final/en_US/"
con <- file(paste0(file_path, "en_US.twitter.txt"), "r")
readLines(con, 1) ## Read the first line of text
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
readLines(con, 1) ## Read the next line of text
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
readLines(con, 5) ## Read in the next 5 lines of text
## [1] "they've decided its more fun if I don't."
## [2] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [3] "Words from a complete stranger! Made my birthday even better :)"
## [4] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [5] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
close(con) ## It's important to close the connection when you are done.
Identifying appropriate tokens such as words, punctuation, and numbers. Breaking each line into smaller pieces. Note that symbols and numbers are removed automatically.
## [1] "There are 29633678 words and 2360148 lines in the corpus."
| ngram | freq |
|---|---|
| the | 937467 |
| to | 788663 |
| i | 723548 |
| a | 611407 |
| you | 548164 |
| and | 438541 |
| for | 385357 |
| in | 380383 |
| of | 359636 |
| is | 358787 |
From the histogram, We could see that the frequency is very skewed with a wide range.
Now let’s take a look the frequency of the First Word.
| first_word | freq |
|---|---|
| i | 207633 |
| thanks | 52266 |
| the | 44893 |
| you | 40155 |
| i’m | 38130 |
| rt | 36509 |
| just | 29925 |
| my | 27231 |
| if | 26255 |
| what | 25669 |
We could see that the frequency is very skewed with a wide range.
| ngram | freq |
|---|---|
| in the | 78457 |
| for the | 73969 |
| of the | 56960 |
| on the | 48542 |
| to be | 47097 |
| to the | 43440 |
| thanks for | 43007 |
| at the | 37245 |
| i love | 35925 |
| going to | 34277 |
From the histogram, We could see that the frequency distribution is very skewed with a wide range.
| ngram | freq |
|---|---|
| thanks for the | 23621 |
| looking forward to | 8833 |
| thank you for | 8687 |
| i love you | 8423 |
| for the follow | 7932 |
| going to be | 7416 |
| can’t wait to | 7346 |
| i want to | 7118 |
| a lot of | 6250 |
| to be a | 5995 |
From the histogram, We could see that the frequency distribution is very skewed with a wide range.
To build models that would predict which word is going to be typed,
so that questions below could be answered by the models.
(1) In a blank input box, predict which words are most likely to be
typed in first based on the frequency of the first word in a line we
have;
(2) After the user has input one word, predict which word will be typed
next based on the frequency of the 2-grams we have;
(3) After the user has input two word, predict which word will be typed
next based on the frequecny of the 3-grams we have.