# read twitter txt
con <- file("./data/en_US/en_US.twitter.txt", "r")
readLines(con, 5)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
lines_twitter <- readLines(con)
len_twitter <- length(lines_twitter)
str(lines_twitter) # 2,360,142
## chr [1:2360143] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!" ...
# read news txt
con <- file("./data/en_US/en_US.news.txt", "r")
readLines(con, 5)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
lines_news <- readLines(con)
len_news <- length(lines_news)
str(lines_news) # 1,010,236
## chr [1:1010237] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to "| __truncated__ ...
# read blogs txt
con <- file("./data/en_US/en_US.blogs.txt", "r")
readLines(con, 5)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
lines_blog <- readLines(con)
len_blog <- length(lines_blog)
str(lines_blog)
## chr [1:899283] "If you have an alternative argument, let's hear it! :)" ...
# close connection
close(con)
sample_size <- 10000
training_index <- sample(seq_len(len_twitter), size = sample_size)
sample_twitter <- lines_twitter[training_index]
write(sample_twitter, file = './data/en_US/en_US.twitter_sample.txt')
# sample_size <- floor(0.1 * len_news)
training_index <- sample(seq_len(len_news), size = sample_size)
sample_news <- lines_news[training_index]
write(sample_news, file = './data/en_US/en_US.news_sample.txt')
# sample_size <- floor(0.1 * len_blog)
training_index <- sample(seq_len(len_blog), size = sample_size)
sample_blog <- lines_blog[training_index]
write(sample_blog, file = './data/en_US/en_US.blog_sample.txt')
# function to load the sample file into a one line list
load_text <- function(path) {
con <- file(path, 'r')
text <- readLines(con)
# paste all news sentences into one long sentence
text <- paste(as.list(text), collapse = ' ')
close(con)
text
}
# function to clean text
clean_text <- function(dirty_text) {
# remove non-words, whitespace, but keep . for ngrams
clean_text <- gsub(x = dirty_text, pattern = '[^\\w\\s\\.]', replacement = "", perl=TRUE)
# remove ellipse '...'
clean_text <- gsub(x = clean_text, pattern = '[.]{2,}', replacement = "", perl=TRUE)
# change sentence period, and space between the last word and the period
clean_text <- gsub(x = clean_text, pattern = '[.]\\ +(?=[A-Z])', replacement = " . ", perl=TRUE)
# lower case
clean_text <- tolower(clean_text)
clean_text
}
# function to deliver unigram, bigram, and trigram
library(tm)
# unigram_count
unigram_count <- function(clean_text, stopwords = FALSE) {
if (stopwords) {
clean_text <- removeWords(clean_text, stopwords('english'))
}
# split sentences by whitespace
words <- strsplit(clean_text, split = '\\s+')[[1]]
# unigram
freq.unigram <- sort(table(words), decreasing = TRUE)
freq.unigram <- as.data.frame(freq.unigram)
freq.unigram$name <- rownames(freq.unigram)
freq.unigram
}
# bigram_count
bigram_count <- function(clean_text, stopwords = FALSE) {
if (stopwords) {
clean_text <- removeWords(clean_text, stopwords('english'))
}
# split sentences by whitespace
words <- strsplit(clean_text, split = '\\s+')[[1]]
# remove the head word and tail one '.'
words2 <- c(words[-1], '.')
pairs <- cbind(words, words2)
# subset the pairs without '.'
pairs <- subset(pairs, words != "." & words2 != ".")
# paste the pairs into string
bigram <- paste(pairs[,1], pairs[,2], sep = " ")
# bigram frequency
freq.bigram <- sort(table(bigram), decreasing = TRUE)
freq.bigram <- as.data.frame(freq.bigram)
freq.bigram$name <- rownames(freq.bigram)
freq.bigram
}
# trigram count
trigram_count <- function(clean_text, stopwords = FALSE) {
if (stopwords) {
clean_text <- removeWords(clean_text, stopwords('english'))
}
# split sentences by whitespace
words <- strsplit(clean_text, split = '\\s+')[[1]]
# remove the head word and tail one '.'
words2 <- c(words[-1], '.')
words3 <- c(words2[-1], '.')
pairs3 <- cbind(words, words2, words3)
pairs3 <- subset(pairs3, words != "." & words2 != "." & words3 != ".")
trigram <- paste(pairs3[,1], pairs3[,2], pairs3[,3], sep = " ")
# bigram frequency
freq.trigram <- sort(table(trigram), decreasing = TRUE)
freq.trigram <- as.data.frame(freq.trigram)
freq.trigram$name <- rownames(freq.trigram)
freq.trigram
}
# twitter part
twitter_path = './data/en_US/en_US.twitter_sample.txt'
twitter_before <- load_text(twitter_path)
substr(twitter_before, 100, 1000)
## [1] " was all chalk and (yawn) boring back on Thursday? hey beautiful, what u up 2? looking great. keep up the good work! omg(: so excited!! Hope you do great(: hey hun.. I didnt gt it.. Its been a min u been m.i.a lol Karma is kickin me all up :/ OMG! Can't wait for the release of Photoshop CS6 Extended...the 3D effects are pretty sweet. had the best time at the tailgate..didn't even make it to the game! FAIL !! (Y) Oh, I'm liking the thirties so far. Just should have gone to bed sooner! Hey check us out man :) Your boy was in charge all day!!! #wheretheydothatat! I held it down!!! Chapter Chair of introduced , Director - #ebiznow Philadelphia I will fuck a bitch up over my mama fucking family or night I dnt give 2 fucks she had me nt u hoe lol, I'm gonna get on later(: then we can video chat nice! Will re-purpose immediately I think if I become a regular food network show judge where I get t"
twitter_after <- clean_text(twitter_before)
substr(twitter_after, 100, 1000)
## [1] "halk and yawn boring back on thursday hey beautiful what u up 2 looking great. keep up the good work omg so excited hope you do great hey hun i didnt gt it its been a min u been m.i.a lol karma is kickin me all up omg cant wait for the release of photoshop cs6 extendedthe 3d effects are pretty sweet. had the best time at the tailgatedidnt even make it to the game fail y oh im liking the thirties so far . just should have gone to bed sooner hey check us out man your boy was in charge all day wheretheydothatat i held it down chapter chair of introduced director ebiznow philadelphia i will fuck a bitch up over my mama fucking family or night i dnt give 2 fucks she had me nt u hoe lol im gonna get on later then we can video chat nice will repurpose immediately i think if i become a regular food network show judge where i get to eat all the time id be happy. very beautiful emma . cant wai"
# news part
news_path <- './data/en_US/en_US.news_sample.txt'
news_before <- load_text(news_path)
substr(news_before, 100, 1000)
## [1] " producer on the HBO series set to debut April 22. Rich is also, of course, one of the most influential cultural critics of the era for his work as a Sunday columnist for The New York Times. He is leaning forward in his chair, hanging on her every word. Aimee Nassif, Chesterfield's planning and development director, noted that the city is still working with the Taubman project on its improvement plans and that it still has to submit an application for building permits. Chen, 40, spent most of the last seven years in prison or under house arrest in what was seen as retribution by local authorities for his activism against forced abortions and other official misdeeds. His wife, daughter and mother were confined at home with him, enduring beatings, searches and other mistreatment. Each time Hinkle has offered a bill to such an end, he says, it has failed to even get a hearing. Some brides ch"
news_after <- clean_text(news_before)
substr(news_after, 100, 1000)
## [1] "roducer on the hbo series set to debut april 22 . rich is also of course one of the most influential cultural critics of the era for his work as a sunday columnist for the new york times . he is leaning forward in his chair hanging on her every word . aimee nassif chesterfields planning and development director noted that the city is still working with the taubman project on its improvement plans and that it still has to submit an application for building permits . chen 40 spent most of the last seven years in prison or under house arrest in what was seen as retribution by local authorities for his activism against forced abortions and other official misdeeds . his wife daughter and mother were confined at home with him enduring beatings searches and other mistreatment . each time hinkle has offered a bill to such an end he says it has failed to even get a hearing . some brides choose an "
# blog part
blog_path <- './data/en_US/en_US.blog_sample.txt'
blog_before <- load_text(blog_path)
substr(blog_before, 100, 1000)
## [1] " buy plants that are perfect for your climate. They even offer you an option to buy seeds that you can use where you live and design your garden to make the most out of your available space. It also seems that for those on business programming most reported economic data can also be spun in a positive way and that if there are two conflicting reports, the one showing an improving economy will be highlighted and promoted. 5. This is the soft sell. Also returning will be Chrissie Tobas and Theresa Gerber. They will continue as Guest Designers for another month. Isn't this fabulous news? Congratulations to both of you. We are all so happy you decided to give us the pleasure of a little more time working with you. Even though it was a run, we walked, it was a lot better that way (more color). It was fun to see the different costumes and all of the other people covered in color. My favorite pa"
blog_after <- clean_text(blog_before)
substr(blog_after, 100, 1000)
## [1] " buy plants that are perfect for your climate . they even offer you an option to buy seeds that you can use where you live and design your garden to make the most out of your available space . it also seems that for those on business programming most reported economic data can also be spun in a positive way and that if there are two conflicting reports the one showing an improving economy will be highlighted and promoted. 5 . this is the soft sell . also returning will be chrissie tobas and theresa gerber . they will continue as guest designers for another month . isnt this fabulous news congratulations to both of you . we are all so happy you decided to give us the pleasure of a little more time working with you . even though it was a run we walked it was a lot better that way more color . it was fun to see the different costumes and all of the other people covered in color . my favorite"
# twitter part
twitter_unigram <- unigram_count(twitter_after)
head(twitter_unigram, 20)
## freq.unigram name
## . 5132 .
## the 3881 the
## to 3306 to
## i 3077 i
## a 2611 a
## you 2255 you
## and 1886 and
## in 1659 in
## for 1607 for
## is 1568 is
## of 1429 of
## my 1240 my
## it 1223 it
## on 1178 on
## that 942 that
## me 863 me
## be 807 be
## at 752 at
## have 750 have
## with 717 with
twitter_bigram <- bigram_count(twitter_after)
head(twitter_bigram, 20)
## freq.bigram name
## in the 338 in the
## for the 324 for the
## of the 231 of the
## on the 209 on the
## to be 199 to be
## thanks for 194 thanks for
## to the 188 to the
## have a 153 have a
## at the 151 at the
## to see 142 to see
## i love 139 i love
## to get 133 to get
## i have 130 i have
## going to 127 going to
## if you 125 if you
## is a 117 is a
## will be 117 will be
## i am 115 i am
## for a 114 for a
## i was 113 i was
twitter_trigram <- trigram_count(twitter_after)
head(twitter_trigram, 20)
## freq.trigram name
## thanks for the 107 thanks for the
## me me me 37 me me me
## going to be 36 going to be
## i love you 34 i love you
## cant wait to 33 cant wait to
## for the follow 33 for the follow
## looking forward to 32 looking forward to
## have a great 28 have a great
## to see you 28 to see you
## i need to 25 i need to
## thank you for 25 thank you for
## cant wait for 24 cant wait for
## for the rt 23 for the rt
## i want to 23 i want to
## to be a 23 to be a
## a lot of 22 a lot of
## is going to 21 is going to
## let me know 21 let me know
## one of the 21 one of the
## i cant wait 20 i cant wait
# news part
news_unigram <- unigram_count(news_after)
head(news_unigram, 20)
## freq.unigram name
## the 19600 the
## . 18168 .
## to 9029 to
## and 8915 and
## a 8708 a
## of 7701 of
## in 6686 in
## for 3482 for
## that 3385 that
## is 2775 is
## on 2670 on
## with 2500 with
## said 2448 said
## he 2333 he
## was 2311 was
## it 2253 it
## at 2140 at
## as 1874 as
## i 1669 i
## his 1641 his
news_bigram <- bigram_count(news_after)
head(news_bigram, 20)
## freq.bigram name
## of the 1876 of the
## in the 1772 in the
## to the 868 to the
## on the 756 on the
## for the 702 for the
## at the 604 at the
## and the 535 and the
## in a 520 in a
## to be 503 to be
## with the 412 with the
## from the 394 from the
## with a 363 with a
## he said 343 he said
## as a 333 as a
## for a 310 for a
## of a 297 of a
## is a 291 is a
## it was 291 it was
## that the 285 that the
## and a 273 and a
news_trigram <- trigram_count(news_after)
head(news_trigram, 20)
## freq.trigram name
## a lot of 141 a lot of
## one of the 141 one of the
## to be a 69 to be a
## according to the 64 according to the
## part of the 64 part of the
## going to be 61 going to be
## the end of 55 the end of
## out of the 53 out of the
## in the first 51 in the first
## the united states 50 the united states
## as well as 49 as well as
## some of the 46 some of the
## the first time 42 the first time
## the university of 41 the university of
## end of the 37 end of the
## it was a 37 it was a
## is going to 35 is going to
## most of the 35 most of the
## said in a 34 said in a
## at the time 33 at the time
# blog part
blog_unigram <- unigram_count(blog_after)
head(blog_unigram, 20)
## freq.unigram name
## the 21037 the
## . 19307 .
## and 12046 and
## to 12042 to
## a 10012 a
## of 9978 of
## i 8412 i
## in 6440 in
## that 5215 that
## is 4877 is
## it 4484 it
## for 4049 for
## you 3288 you
## with 3209 with
## on 3078 on
## was 3040 was
## my 2875 my
## this 2853 this
## as 2487 as
## have 2382 have
blog_bigram <- bigram_count(blog_after)
head(blog_bigram, 20)
## freq.bigram name
## of the 2180 of the
## in the 1638 in the
## to the 976 to the
## on the 896 on the
## to be 743 to be
## and the 670 and the
## for the 669 for the
## and i 563 and i
## at the 551 at the
## it was 535 it was
## it is 527 it is
## with the 522 with the
## in a 516 in a
## i was 513 i was
## i am 504 i am
## is a 490 is a
## i have 486 i have
## from the 415 from the
## of a 405 of a
## that i 391 that i
blog_trigram <- trigram_count(blog_after)
head(blog_trigram, 20)
## freq.trigram name
## one of the 160 one of the
## a lot of 116 a lot of
## as well as 80 as well as
## it was a 80 it was a
## some of the 76 some of the
## i dont know 72 i dont know
## the end of 70 the end of
## a couple of 69 a couple of
## the fact that 68 the fact that
## this is the 68 this is the
## be able to 64 be able to
## i have a 63 i have a
## the rest of 62 the rest of
## to be a 60 to be a
## one of my 59 one of my
## out of the 57 out of the
## part of the 57 part of the
## i had to 56 i had to
## it is a 55 it is a
## this is a 55 this is a
library(ggplot2)
# twitter ngrams plotting
ggplot(twitter_unigram[1:20,], aes(x = name, y = freq.unigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Unigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
ggplot(twitter_bigram[1:20,], aes(x = name, y = freq.bigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Bigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
ggplot(twitter_trigram[1:20,], aes(x = name, y = freq.trigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Trigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
# news ngrams plotting
ggplot(news_unigram[1:20,], aes(x = name, y = freq.unigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Unigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
ggplot(news_bigram[1:20,], aes(x = name, y = freq.bigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Bigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
ggplot(news_trigram[1:20,], aes(x = name, y = freq.trigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Trigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
# blog ngrams plotting
ggplot(blog_unigram[1:20,], aes(x = name, y = freq.unigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Unigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
ggplot(blog_bigram[1:20,], aes(x = name, y = freq.bigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Bigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
ggplot(blog_trigram[1:20,], aes(x = name, y = freq.trigram)) +
geom_bar(stat = "identity") +
ggtitle("Top 20 Trigram Frequency") +
theme(axis.text.x = element_text(angle = 45))
# remove '.'
twitter_instance <- twitter_unigram[-1,]$freq.unigram
# function to count the instance coverage
unique_cover <- function(instance, percent) {
instance_count <- 0
for (index in 1 : length(instance)) {
instance_count <- instance_count + instance[index]
if (instance_count / sum(instance) >= percent) {
print(index)
break
}
}
}
unique_cover(twitter_instance, 0.5)
## [1] 128
unique_cover(twitter_instance, 0.9)
## [1] 5371
library("textcat")
## Warning: package 'textcat' was built under R version 3.2.3
con <- file(twitter_path, 'r')
twitter_text <- readLines(con)
close(con)
twitter_lan <- lapply(twitter_text, textcat)
table((twitter_lan == 'english'))
##
## FALSE TRUE
## 4040 5957
cbind(twitter_text[1:10], twitter_lan[1:10])
## [,1]
## [1,] "うん!面白かった!Very good English! I loved the Metal Gear ringtone."
## [2,] "Remember when the this #NCAATournament was all chalk and (yawn) boring back on Thursday?"
## [3,] "hey beautiful, what u up 2? looking great. keep up the good work!"
## [4,] "omg(: so excited!! Hope you do great(:"
## [5,] "hey hun.. I didnt gt it.. Its been a min u been m.i.a lol"
## [6,] "Karma is kickin me all up :/"
## [7,] "OMG! Can't wait for the release of Photoshop CS6 Extended...the 3D effects are pretty sweet."
## [8,] "had the best time at the tailgate..didn't even make it to the game! FAIL !! (Y)"
## [9,] "Oh, I'm liking the thirties so far. Just should have gone to bed sooner!"
## [10,] "Hey check us out man :)"
## [,2]
## [1,] "scots"
## [2,] "scots"
## [3,] "scots"
## [4,] "middle_frisian"
## [5,] "middle_frisian"
## [6,] "middle_frisian"
## [7,] "english"
## [8,] "scots"
## [9,] "english"
## [10,] "breton"
I use ‘textcat’ package to detect the foreign languages. The results are not good. Apparently many sentences sentences are intepreted as non-english. Unicodes and langauge dictionalry could be combined to detect the lanaguages. Unicodes would work on word symbols, and dictionary on shared alphabetics.
– Build sysnonym dictionary to reduce the unique words volumn, less words represnet more instances. – Use ngram prediction to cover extra words not shown in the highest unique words
Unigram, bigram, and trigram in dataframe are already stored in the previous work. Based on that, I could calcuate the conditional probalbity by counting the number of instance over given the conditional instances. Following the Markov Chain, I can get the total conditional probability. The word pair with highest probablity is to be the final prediciton
To make the model smaller and efficient, it is necessary to reduce the number of unique word pairs. One way is to use synonym. The other is to POS to annotate the words, so the number of unique word pairs would reduce as tag pairs.
I start with 3 parameters. If the performance is not good, I probably increase n.
Add-one smoothing is the simplest way. Just add one count to all words including unknown words.
Cross validation. Split the dataset into training and validation sets, and use the validation sets to evaluate the model performance.
If there is no prediction for the n-grams, we go back to n-1 grams. So does for the n-1 grams. Here if trigrams prediciton produces nothing, we have bigram to take over, and then unigram.
Since this project is for the typing prediction, the Shiny App should work as receiving text input and outputing next word prediction. The fancy way would be dynamic like what we are typing on iphone, and the prediction pops up instantly.