Capstone Project week2

Load files

# read twitter txt
con <- file("./data/en_US/en_US.twitter.txt", "r") 
readLines(con, 5)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"

lines_twitter <- readLines(con)
len_twitter <- length(lines_twitter)
str(lines_twitter) # 2,360,142

##  chr [1:2360143] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!" ...

# read news txt
con <- file("./data/en_US/en_US.news.txt", "r") 
readLines(con, 5)

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"

lines_news <- readLines(con)
len_news <- length(lines_news)
str(lines_news) # 1,010,236

##  chr [1:1010237] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to "| __truncated__ ...

# read blogs txt
con <- file("./data/en_US/en_US.blogs.txt", "r") 
readLines(con, 5)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"

lines_blog <- readLines(con)
len_blog <- length(lines_blog)
str(lines_blog)

##  chr [1:899283] "If you have an alternative argument, let's hear it! :)" ...

# close connection
close(con)

Sample 10,000 records respectively

sample_size <- 10000
training_index <- sample(seq_len(len_twitter), size = sample_size)
sample_twitter <- lines_twitter[training_index]
write(sample_twitter, file = './data/en_US/en_US.twitter_sample.txt')

# sample_size <- floor(0.1 * len_news)
training_index <- sample(seq_len(len_news), size = sample_size)
sample_news <- lines_news[training_index]
write(sample_news, file = './data/en_US/en_US.news_sample.txt')

# sample_size <- floor(0.1 * len_blog)
training_index <- sample(seq_len(len_blog), size = sample_size)
sample_blog <- lines_blog[training_index]
write(sample_blog, file = './data/en_US/en_US.blog_sample.txt')

Text preprocessing

Define functions to clean text and count words

# function to load the sample file into a one line list
load_text <- function(path) {
    con <- file(path, 'r')
    text <- readLines(con)
    # paste all news sentences into one long sentence
    text <- paste(as.list(text), collapse = ' ')
    close(con)
    text
}
# function to clean text
clean_text <- function(dirty_text) {
    # remove non-words, whitespace, but keep . for ngrams
    clean_text <- gsub(x = dirty_text, pattern = '[^\\w\\s\\.]', replacement = "", perl=TRUE)
    # remove ellipse '...'
    clean_text <- gsub(x = clean_text, pattern = '[.]{2,}', replacement = "", perl=TRUE)
    # change sentence period, and space between the last word and the period
    clean_text <- gsub(x = clean_text, pattern = '[.]\\ +(?=[A-Z])', replacement = " . ", perl=TRUE)
    # lower case
    clean_text <- tolower(clean_text)
    clean_text
}

# function to deliver unigram, bigram, and trigram
library(tm)
# unigram_count
unigram_count <- function(clean_text, stopwords = FALSE) {
    if (stopwords) {
        clean_text  <- removeWords(clean_text, stopwords('english'))
    }
    # split sentences by whitespace
    words <- strsplit(clean_text, split = '\\s+')[[1]]
    # unigram
    freq.unigram <- sort(table(words), decreasing = TRUE)
    freq.unigram <- as.data.frame(freq.unigram)
    freq.unigram$name <- rownames(freq.unigram)
    freq.unigram
}
# bigram_count
bigram_count <- function(clean_text, stopwords = FALSE) {
    if (stopwords) {
        clean_text  <- removeWords(clean_text, stopwords('english'))
    }
    # split sentences by whitespace
    words <- strsplit(clean_text, split = '\\s+')[[1]]
    # remove the head word and tail one '.'
    words2 <- c(words[-1], '.')
    pairs <- cbind(words, words2)
    # subset the pairs without '.'
    pairs <- subset(pairs, words != "." & words2 != ".")
    # paste the pairs into string
    bigram <- paste(pairs[,1], pairs[,2], sep = " ")
    # bigram frequency
    freq.bigram <- sort(table(bigram), decreasing = TRUE)
    freq.bigram <- as.data.frame(freq.bigram)
    freq.bigram$name <- rownames(freq.bigram)
    freq.bigram
}
# trigram count
trigram_count <- function(clean_text, stopwords = FALSE) {
    if (stopwords) {
        clean_text  <- removeWords(clean_text, stopwords('english'))
    }
    # split sentences by whitespace
    words <- strsplit(clean_text, split = '\\s+')[[1]]
    # remove the head word and tail one '.'
    words2 <- c(words[-1], '.')
    words3 <- c(words2[-1], '.')
    pairs3 <- cbind(words, words2, words3)
    pairs3 <- subset(pairs3, words != "." & words2 != "." & words3 != ".")
    trigram <- paste(pairs3[,1], pairs3[,2], pairs3[,3], sep = " ")
    # bigram frequency
    freq.trigram <- sort(table(trigram), decreasing = TRUE)
    freq.trigram <- as.data.frame(freq.trigram)
    freq.trigram$name <- rownames(freq.trigram)
    freq.trigram
}

Preprocess sample files

# twitter part
twitter_path = './data/en_US/en_US.twitter_sample.txt'
twitter_before <- load_text(twitter_path)
substr(twitter_before, 100, 1000)

## [1] " was all chalk and (yawn) boring back on Thursday? hey beautiful, what u up 2? looking great. keep up the good work! omg(: so excited!! Hope you do great(: hey hun.. I didnt gt it.. Its been a min u been m.i.a lol Karma is kickin me all up :/ OMG! Can't wait for the release of Photoshop CS6 Extended...the 3D effects are pretty sweet. had the best time at the tailgate..didn't even make it to the game! FAIL !! (Y) Oh, I'm liking the thirties so far. Just should have gone to bed sooner! Hey check us out man :) Your boy was in charge all day!!! #wheretheydothatat! I held it down!!! Chapter Chair of introduced , Director - #ebiznow Philadelphia I will fuck a bitch up over my mama fucking family or night I dnt give 2 fucks she had me nt u hoe lol, I'm gonna get on later(: then we can video chat nice! Will re-purpose immediately I think if I become a regular food network show judge where I get t"

twitter_after <- clean_text(twitter_before)
substr(twitter_after, 100, 1000)

## [1] "halk and yawn boring back on thursday hey beautiful what u up 2 looking great. keep up the good work omg so excited hope you do great hey hun i didnt gt it its been a min u been m.i.a lol karma is kickin me all up  omg cant wait for the release of photoshop cs6 extendedthe 3d effects are pretty sweet. had the best time at the tailgatedidnt even make it to the game fail  y oh im liking the thirties so far . just should have gone to bed sooner hey check us out man  your boy was in charge all day wheretheydothatat i held it down chapter chair of introduced  director  ebiznow philadelphia i will fuck a bitch up over my mama fucking family or night i dnt give 2 fucks she had me nt u hoe lol im gonna get on later then we can video chat nice will repurpose immediately i think if i become a regular food network show judge where i get to eat all the time id be happy. very beautiful emma . cant wai"

# news part
news_path <- './data/en_US/en_US.news_sample.txt'
news_before <- load_text(news_path)
substr(news_before, 100, 1000)

## [1] " producer on the HBO series set to debut April 22. Rich is also, of course, one of the most influential cultural critics of the era for his work as a Sunday columnist for The New York Times. He is leaning forward in his chair, hanging on her every word. Aimee Nassif, Chesterfield's planning and development director, noted that the city is still working with the Taubman project on its improvement plans and that it still has to submit an application for building permits. Chen, 40, spent most of the last seven years in prison or under house arrest in what was seen as retribution by local authorities for his activism against forced abortions and other official misdeeds. His wife, daughter and mother were confined at home with him, enduring beatings, searches and other mistreatment. Each time Hinkle has offered a bill to such an end, he says, it has failed to even get a hearing. Some brides ch"

news_after <- clean_text(news_before)
substr(news_after, 100, 1000)

## [1] "roducer on the hbo series set to debut april 22 . rich is also of course one of the most influential cultural critics of the era for his work as a sunday columnist for the new york times . he is leaning forward in his chair hanging on her every word . aimee nassif chesterfields planning and development director noted that the city is still working with the taubman project on its improvement plans and that it still has to submit an application for building permits . chen 40 spent most of the last seven years in prison or under house arrest in what was seen as retribution by local authorities for his activism against forced abortions and other official misdeeds . his wife daughter and mother were confined at home with him enduring beatings searches and other mistreatment . each time hinkle has offered a bill to such an end he says it has failed to even get a hearing . some brides choose an "

# blog part
blog_path <- './data/en_US/en_US.blog_sample.txt'
blog_before <- load_text(blog_path)
substr(blog_before, 100, 1000)

## [1] " buy plants that are perfect for your climate. They even offer you an option to buy seeds that you can use where you live and design your garden to make the most out of your available space. It also seems that for those on business programming most reported economic data can also be spun in a positive way and that if there are two conflicting reports, the one showing an improving economy will be highlighted and promoted. 5. This is the soft sell. Also returning will be Chrissie Tobas and Theresa Gerber. They will continue as Guest Designers for another month. Isn't this fabulous news? Congratulations to both of you. We are all so happy you decided to give us the pleasure of a little more time working with you. Even though it was a run, we walked, it was a lot better that way (more color). It was fun to see the different costumes and all of the other people covered in color. My favorite pa"

blog_after <- clean_text(blog_before)
substr(blog_after, 100, 1000)

## [1] " buy plants that are perfect for your climate . they even offer you an option to buy seeds that you can use where you live and design your garden to make the most out of your available space . it also seems that for those on business programming most reported economic data can also be spun in a positive way and that if there are two conflicting reports the one showing an improving economy will be highlighted and promoted. 5 . this is the soft sell . also returning will be chrissie tobas and theresa gerber . they will continue as guest designers for another month . isnt this fabulous news congratulations to both of you . we are all so happy you decided to give us the pleasure of a little more time working with you . even though it was a run we walked it was a lot better that way more color . it was fun to see the different costumes and all of the other people covered in color . my favorite"

Ngram counts

# twitter part
twitter_unigram <- unigram_count(twitter_after)
head(twitter_unigram, 20)

##      freq.unigram name
## .            5132    .
## the          3881  the
## to           3306   to
## i            3077    i
## a            2611    a
## you          2255  you
## and          1886  and
## in           1659   in
## for          1607  for
## is           1568   is
## of           1429   of
## my           1240   my
## it           1223   it
## on           1178   on
## that          942 that
## me            863   me
## be            807   be
## at            752   at
## have          750 have
## with          717 with

twitter_bigram <- bigram_count(twitter_after)
head(twitter_bigram, 20)

##            freq.bigram       name
## in the             338     in the
## for the            324    for the
## of the             231     of the
## on the             209     on the
## to be              199      to be
## thanks for         194 thanks for
## to the             188     to the
## have a             153     have a
## at the             151     at the
## to see             142     to see
## i love             139     i love
## to get             133     to get
## i have             130     i have
## going to           127   going to
## if you             125     if you
## is a               117       is a
## will be            117    will be
## i am               115       i am
## for a              114      for a
## i was              113      i was

twitter_trigram <- trigram_count(twitter_after)
head(twitter_trigram, 20)

##                    freq.trigram               name
## thanks for the              107     thanks for the
## me me me                     37           me me me
## going to be                  36        going to be
## i love you                   34         i love you
## cant wait to                 33       cant wait to
## for the follow               33     for the follow
## looking forward to           32 looking forward to
## have a great                 28       have a great
## to see you                   28         to see you
## i need to                    25          i need to
## thank you for                25      thank you for
## cant wait for                24      cant wait for
## for the rt                   23         for the rt
## i want to                    23          i want to
## to be a                      23            to be a
## a lot of                     22           a lot of
## is going to                  21        is going to
## let me know                  21        let me know
## one of the                   21         one of the
## i cant wait                  20        i cant wait

# news part
news_unigram <- unigram_count(news_after)
head(news_unigram, 20)

##      freq.unigram name
## the         19600  the
## .           18168    .
## to           9029   to
## and          8915  and
## a            8708    a
## of           7701   of
## in           6686   in
## for          3482  for
## that         3385 that
## is           2775   is
## on           2670   on
## with         2500 with
## said         2448 said
## he           2333   he
## was          2311  was
## it           2253   it
## at           2140   at
## as           1874   as
## i            1669    i
## his          1641  his

news_bigram <- bigram_count(news_after)
head(news_bigram, 20)

##          freq.bigram     name
## of the          1876   of the
## in the          1772   in the
## to the           868   to the
## on the           756   on the
## for the          702  for the
## at the           604   at the
## and the          535  and the
## in a             520     in a
## to be            503    to be
## with the         412 with the
## from the         394 from the
## with a           363   with a
## he said          343  he said
## as a             333     as a
## for a            310    for a
## of a             297     of a
## is a             291     is a
## it was           291   it was
## that the         285 that the
## and a            273    and a

news_trigram <- trigram_count(news_after)
head(news_trigram, 20)

##                   freq.trigram              name
## a lot of                   141          a lot of
## one of the                 141        one of the
## to be a                     69           to be a
## according to the            64  according to the
## part of the                 64       part of the
## going to be                 61       going to be
## the end of                  55        the end of
## out of the                  53        out of the
## in the first                51      in the first
## the united states           50 the united states
## as well as                  49        as well as
## some of the                 46       some of the
## the first time              42    the first time
## the university of           41 the university of
## end of the                  37        end of the
## it was a                    37          it was a
## is going to                 35       is going to
## most of the                 35       most of the
## said in a                   34         said in a
## at the time                 33       at the time

# blog part
blog_unigram <- unigram_count(blog_after)
head(blog_unigram, 20)

##      freq.unigram name
## the         21037  the
## .           19307    .
## and         12046  and
## to          12042   to
## a           10012    a
## of           9978   of
## i            8412    i
## in           6440   in
## that         5215 that
## is           4877   is
## it           4484   it
## for          4049  for
## you          3288  you
## with         3209 with
## on           3078   on
## was          3040  was
## my           2875   my
## this         2853 this
## as           2487   as
## have         2382 have

blog_bigram <- bigram_count(blog_after)
head(blog_bigram, 20)

##          freq.bigram     name
## of the          2180   of the
## in the          1638   in the
## to the           976   to the
## on the           896   on the
## to be            743    to be
## and the          670  and the
## for the          669  for the
## and i            563    and i
## at the           551   at the
## it was           535   it was
## it is            527    it is
## with the         522 with the
## in a             516     in a
## i was            513    i was
## i am             504     i am
## is a             490     is a
## i have           486   i have
## from the         415 from the
## of a             405     of a
## that i           391   that i

blog_trigram <- trigram_count(blog_after)
head(blog_trigram, 20)

##               freq.trigram          name
## one of the             160    one of the
## a lot of               116      a lot of
## as well as              80    as well as
## it was a                80      it was a
## some of the             76   some of the
## i dont know             72   i dont know
## the end of              70    the end of
## a couple of             69   a couple of
## the fact that           68 the fact that
## this is the             68   this is the
## be able to              64    be able to
## i have a                63      i have a
## the rest of             62   the rest of
## to be a                 60       to be a
## one of my               59     one of my
## out of the              57    out of the
## part of the             57   part of the
## i had to                56      i had to
## it is a                 55       it is a
## this is a               55     this is a

Corpus exploring

library(ggplot2)
# twitter ngrams plotting
ggplot(twitter_unigram[1:20,], aes(x = name, y = freq.unigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Unigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

ggplot(twitter_bigram[1:20,], aes(x = name, y = freq.bigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Bigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

ggplot(twitter_trigram[1:20,], aes(x = name, y = freq.trigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Trigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

# news ngrams plotting
ggplot(news_unigram[1:20,], aes(x = name, y = freq.unigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Unigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

ggplot(news_bigram[1:20,], aes(x = name, y = freq.bigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Bigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

ggplot(news_trigram[1:20,], aes(x = name, y = freq.trigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Trigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

# blog ngrams plotting
ggplot(blog_unigram[1:20,], aes(x = name, y = freq.unigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Unigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

ggplot(blog_bigram[1:20,], aes(x = name, y = freq.bigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Bigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

ggplot(blog_trigram[1:20,], aes(x = name, y = freq.trigram)) + 
    geom_bar(stat = "identity") + 
    ggtitle("Top 20 Trigram Frequency") +
    theme(axis.text.x = element_text(angle = 45))

What are the frequencies of unigram, bigrams and trigrams in the dataset?

The unigram plotting shows stopwords have the highest frequency.
Frequecies decrease on average when the number of grams grow up. It makes sense.

How many unique words to cover 50% or 90% of all word instances in the language?

# remove '.'
twitter_instance <- twitter_unigram[-1,]$freq.unigram
# function to count the instance coverage
unique_cover <- function(instance, percent) {
    instance_count <- 0
    for (index in 1 : length(instance)) {
        instance_count <- instance_count + instance[index]
        if (instance_count / sum(instance) >= percent) {
            print(index)
            break
        }
    }
}

unique_cover(twitter_instance, 0.5)

## [1] 128

unique_cover(twitter_instance, 0.9)

## [1] 5371

Detect foreign lanagues

library("textcat")

## Warning: package 'textcat' was built under R version 3.2.3

con <- file(twitter_path, 'r')
twitter_text <- readLines(con)
close(con)
twitter_lan <- lapply(twitter_text, textcat)
table((twitter_lan == 'english'))

## 
## FALSE  TRUE 
##  4040  5957

cbind(twitter_text[1:10], twitter_lan[1:10])

##       [,1]                                                                                          
##  [1,] "うん！面白かった！Very good English! I loved the Metal Gear ringtone."                       
##  [2,] "Remember when the this #NCAATournament was all chalk and (yawn) boring back on Thursday?"    
##  [3,] "hey beautiful, what u up 2? looking great. keep up the good work!"                           
##  [4,] "omg(: so excited!! Hope you do great(:"                                                      
##  [5,] "hey hun.. I didnt gt it.. Its been a min u been m.i.a lol"                                   
##  [6,] "Karma is kickin me all up :/"                                                                
##  [7,] "OMG! Can't wait for the release of Photoshop CS6 Extended...the 3D effects are pretty sweet."
##  [8,] "had the best time at the tailgate..didn't even make it to the game! FAIL !! (Y)"             
##  [9,] "Oh, I'm liking the thirties so far. Just should have gone to bed sooner!"                    
## [10,] "Hey check us out man :)"                                                                     
##       [,2]            
##  [1,] "scots"         
##  [2,] "scots"         
##  [3,] "scots"         
##  [4,] "middle_frisian"
##  [5,] "middle_frisian"
##  [6,] "middle_frisian"
##  [7,] "english"       
##  [8,] "scots"         
##  [9,] "english"       
## [10,] "breton"

I use ‘textcat’ package to detect the foreign languages. The results are not good. Apparently many sentences sentences are intepreted as non-english. Unicodes and langauge dictionalry could be combined to detect the lanaguages. Unicodes would work on word symbols, and dictionary on shared alphabetics.

How to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

– Build sysnonym dictionary to reduce the unique words volumn, less words represnet more instances. – Use ngram prediction to cover extra words not shown in the highest unique words

Future plan

How can you efficiently store an n-gram model (think Markov Chains)?

Unigram, bigram, and trigram in dataframe are already stored in the previous work. Based on that, I could calcuate the conditional probalbity by counting the number of instance over given the conditional instances. Following the Markov Chain, I can get the total conditional probability. The word pair with highest probablity is to be the final prediciton

How can you use the knowledge about word frequencies to make your model smaller and more efficient?

To make the model smaller and efficient, it is necessary to reduce the number of unique word pairs. One way is to use synonym. The other is to POS to annotate the words, so the number of unique word pairs would reduce as tag pairs.

How many parameters do you need (i.e. how big is n in your n-gram model)?

I start with 3 parameters. If the performance is not good, I probably increase n.

Can you think of simple ways to “smooth” the probabilities?

Add-one smoothing is the simplest way. Just add one count to all words including unknown words.

How do you evaluate whether your model is any good?

Cross validation. Split the dataset into training and validation sets, and use the validation sets to evaluate the model performance.

How can you use backoff models to estimate the probability of unobserved n-grams?

If there is no prediction for the n-grams, we go back to n-1 grams. So does for the n-1 grams. Here if trigrams prediciton produces nothing, we have bigram to take over, and then unigram.

Plan for Shiny App

Since this project is for the typing prediction, the Shiny App should work as receiving text input and outputing next word prediction. The fancy way would be dynamic like what we are typing on iphone, and the prediction pops up instantly.