Load Data

Load sample lines from en_US.twitter.txt

file_path <- "./Coursera-SwiftKey/final/en_US/"
con <- file(paste0(file_path, "en_US.twitter.txt"), "r") 

readLines(con, 1) ## Read the first line of text

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

readLines(con, 1) ## Read the next line of text

## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

readLines(con, 5) ## Read in the next 5 lines of text

## [1] "they've decided its more fun if I don't."                                                             
## [2] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                 
## [3] "Words from a complete stranger! Made my birthday even better :)"                                      
## [4] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"                        
## [5] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"

close(con) ## It's important to close the connection when you are done.

Data Exploration

Tokenization

Identifying appropriate tokens such as words, punctuation, and numbers. Breaking each line into smaller pieces. Note that symbols and numbers are removed automatically.

Basic Information about the corpus

Word counts and Line counts

## [1] "There are 29633678 words and 2360148 lines in the corpus."

Rankings and Histograms

Words Frequency: Any position in a line

First let’s take a look the frequency of a single word.

Top 10 Words Frequency: Any position in a line
ngram	freq
the	937467
to	788663
i	723548
a	611407
you	548164
and	438541
for	385357
in	380383
of	359636
is	358787

From the histogram, We could see that the frequency is very skewed with a wide range.

Words Frequency: First one in a line

Now let’s take a look the frequency of the First Word.

Top 10 Words Frequency: First one in a line
first_word	freq
i	207633
thanks	52266
the	44893
you	40155
i’m	38130
rt	36509
just	29925
my	27231
if	26255
what	25669

We could see that the frequency is very skewed with a wide range.

2-grams Frequency

Now let’s take a look the frequency of the 2-word phrases.

Top 10 2-grams Frequency
ngram	freq
in the	78457
for the	73969
of the	56960
on the	48542
to be	47097
to the	43440
thanks for	43007
at the	37245
i love	35925
going to	34277

From the histogram, We could see that the frequency distribution is very skewed with a wide range.

3-grams Frequency

Now let’s take a look the frequency of the 3-word phrases.

Top 10 3-grams Frequency
ngram	freq
thanks for the	23621
looking forward to	8833
thank you for	8687
i love you	8423
for the follow	7932
going to be	7416
can’t wait to	7346
i want to	7118
a lot of	6250
to be a	5995

From the histogram, We could see that the frequency distribution is very skewed with a wide range.

Next Steps

To build models that would predict which word is going to be typed, so that questions below could be answered by the models.
(1) In a blank input box, predict which words are most likely to be typed in first based on the frequency of the first word in a line we have;
(2) After the user has input one word, predict which word will be typed next based on the frequency of the 2-grams we have;
(3) After the user has input two word, predict which word will be typed next based on the frequecny of the 3-grams we have.

DSC_Milestone_Report

Wei Chen

2025-08-22

Load Data

Data Exploration

Tokenization

Basic Information about the corpus

Word counts and Line counts

Rankings and Histograms

Words Frequency: Any position in a line

Words Frequency: First one in a line

2-grams Frequency

3-grams Frequency

Next Steps