This document is presented as part of a capstone project. This assessment required the student to apply data science techniques in the area of natural language processing by building a predictive text application. The application is to be capable of scanning a stream of text as it is typed by a user and suggest possibilities for the next word to be appended to the input stream. The application is to be demonstrated via the Shiny platform, which will allow users to type an input text stream and receive a text predictions within a web based environment.
And this report examines a training dataset that will be used to build a predictive text model. The findings serve as an initial exploratory analysis before taking initial steps to building an n-gram model for text prediction.
The plan on getting the data is to read all the data first from a corpus called HC Corpora (www.corpora.heliohost.org). For this step of exploratory analysis, we have selected a sample(5%) of the data from each documents.
After reading the documents, we prepare a corpus containing sample of all three files of US langauge(blogs, news and tweets). And work only on a sample of the data for the 3 documents (Blogs, News, Twitter).
library(NLP)
library(tm)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(readtext)
library(stringr)
library(RWeka)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
The data consist of text from 3 different sources: blogs, news, and twitter feeds and are provided in 4 different languages: German, English (US), Finnish, and Russian. For the remainder of this project, we will use only the the English (US) data sets. Load the US language corpus (blogs, news and tweets) and work only on a 5% sample for the 3 documents (Blogs, News, Twitter):
setwd("C:/Users/rraju/Desktop/Coursera/Capstone project/Coursera-SwiftKey/final/en_US")
data.raw <-VCorpus( DirSource(directory=".",
encoding = "UTF-8"),
readerControl = list(language="us"))
blogsDoc = data.raw[[1]]$content
newsDoc = data.raw[[2]]$content
tweetDoc = data.raw[[3]]$content
Examing the first few lines of each data set:
message("Blogs document")
## Blogs document
head(blogsDoc)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [6] "If you have an alternative argument, let's hear it! :)"
message("There are ",length(blogsDoc), " lines in original en_US.blogs.txt")
## There are 899288 lines in original en_US.blogs.txt
message("News document")
## News document
head(newsDoc)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
message("There are ",length(newsDoc), " lines in original en_US.news.txt")
## There are 77259 lines in original en_US.news.txt
message("Twitter document")
## Twitter document
head(tweetDoc)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
message("There are ",length(tweetDoc), " lines in original en_US.twitter.txt")
## There are 2360148 lines in original en_US.twitter.txt
Some preliminary analysis is conducted on the raw datasets, based on the first Quiz Set.
file.info("C:/Users/rraju/Desktop/Coursera/Capstone project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/1024^2
## [1] 200.4242
length(tweetDoc)
## [1] 2360148
max(nchar(blogsDoc))
## [1] 40833
val_love <- sum(grepl(pattern = "love", x = tweetDoc))
val_hate <- sum(grepl(pattern = "hate", x = tweetDoc))
val_love / val_hate
## [1] 4.108592
tweetDoc[grep(pattern = "biostat", x = tweetDoc)]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
sum(grepl(pattern = "A computer once beat me at chess, but it was no match for me at kickboxing", x = tweetDoc))
## [1] 3
Due to the scale of each dataset, a 5% sample for the 3 documents(Blogs, News, Twitter).
set.seed(1)
blogsDoc.smpl <- sample(data.raw[[1]]$content, length(data.raw[[1]]$content)*0.05) # Blogs
newsDoc.smpl <- sample(data.raw[[2]]$content, length(data.raw[[2]]$content)*0.05) # News
tweetDoc.smpl <- sample(data.raw[[3]]$content, length(data.raw[[3]]$content)*0.05) # Twitter
message("We keep only 5% (= ",length(blogsDoc.smpl), " lines) for en_US.blogs.txt")
## We keep only 5% (= 44964 lines) for en_US.blogs.txt
message("We keep only 5% (= ",length(newsDoc.smpl), " lines) for en_US.news.txt")
## We keep only 5% (= 3862 lines) for en_US.news.txt
message("We keep only 5% (= ",length(tweetDoc.smpl), " lines) for en_US.twitter.txt")
## We keep only 5% (= 118007 lines) for en_US.twitter.txt
smpl <- c(tweetDoc.smpl, newsDoc.smpl, blogsDoc.smpl)
smpl <- iconv(smpl, 'UTF-8', 'ASCII')
data.smpl <- Corpus(VectorSource(as.data.frame(smpl, stringsAsFactors = FALSE)))
Before all analysis procedures, all 3 documents need to be free of following things:
detectNonAsciiChar <- function(x) iconv(x, from="UTF-8", to="ASCII",sub="X")
data.smpl <- tm_map(data.smpl, content_transformer(detectNonAsciiChar))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(detectNonAsciiChar)): transformation drops documents
removeNonAsciiWord <- function(x) gsub("[a-z]*X+[a-z]*", " ", x)
data.smpl <- tm_map(data.smpl, content_transformer(removeNonAsciiWord))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeNonAsciiWord)): transformation drops documents
removeHTTPS <- function(x) gsub("https://(.*)[.][a-z]+|https://[a-z]+", " ", x)
removeHTTP <- function(x) gsub("http://(.*)[.][a-z]+|https://[a-z]+", " ", x)
removeFTP <- function(x) gsub("ftp://(.*)[.][a-z]+|https://[a-z]+", " ", x)
removeWWW <- function(x) gsub("www(.*)[.][a-z]+|www.", " ", x)
removeHashTag <- function(x) gsub("#[a-z0-9]+", " ", x)
removeTwitterRT <- function(x) gsub("^rt |^rt:", " ", x)
data.smpl <- tm_map(data.smpl, content_transformer(removeHTTPS))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeHTTPS)): transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeHTTP))
## Warning in tm_map.SimpleCorpus(data.smpl, content_transformer(removeHTTP)):
## transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeFTP))
## Warning in tm_map.SimpleCorpus(data.smpl, content_transformer(removeFTP)):
## transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeWWW))
## Warning in tm_map.SimpleCorpus(data.smpl, content_transformer(removeWWW)):
## transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeHashTag))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeHashTag)): transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeTwitterRT))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeTwitterRT)): transformation drops documents
data.smpl <- tm_map(data.smpl, removePunctuation)
## Warning in tm_map.SimpleCorpus(data.smpl, removePunctuation):
## transformation drops documents
data.smpl <- tm_map(data.smpl, removeNumbers)
## Warning in tm_map.SimpleCorpus(data.smpl, removeNumbers): transformation
## drops documents
data.smpl <- tm_map(data.smpl, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(data.smpl, removeWords,
## stopwords("english")): transformation drops documents
profanity_words <- readtext("C:\\Users\\rraju\\Desktop\\Coursera\\Capstone project\\Coursera-SwiftKey\\final\\Smear Words.docx")
data.smpl <- tm_map(data.smpl , removeWords, profanity_words)
## Warning in tm_map.SimpleCorpus(data.smpl, removeWords, profanity_words):
## transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(data.smpl, content_transformer(tolower)):
## transformation drops documents
data.smpl <- tm_map(data.smpl, stripWhitespace)
## Warning in tm_map.SimpleCorpus(data.smpl, stripWhitespace): transformation
## drops documents
removeCharRepetition <- function(x) {
a <- gsub("[a-z]*aaa[a-z]*", " ", x)
a <- gsub("[a-z]*bbb[a-z]*", " ", a)
a <- gsub("[a-z]*ccc[a-z]*", " ", a)
a <- gsub("[a-z]*ddd[a-z]*", " ", a)
a <- gsub("[a-z]*eee[a-z]*", " ", a)
a <- gsub("[a-z]*fff[a-z]*", " ", a)
a <- gsub("[a-z]*ggg[a-z]*", " ", a)
a <- gsub("[a-z]*hhh[a-z]*", " ", a)
a <- gsub("[a-z]*iii[a-z]*", " ", a)
a <- gsub("[a-z]*jjj[a-z]*", " ", a)
a <- gsub("[a-z]*kkk[a-z]*", " ", a)
a <- gsub("[a-z]*lll[a-z]*", " ", a)
a <- gsub("[a-z]*mmm[a-z]*", " ", a)
a <- gsub("[a-z]*nnn[a-z]*", " ", a)
a <- gsub("[a-z]*ooo[a-z]*", " ", a)
a <- gsub("[a-z]*ppp[a-z]*", " ", a)
a <- gsub("[a-z]*qqq[a-z]*", " ", a)
a <- gsub("[a-z]*rrr[a-z]*", " ", a)
a <- gsub("[a-z]*sss[a-z]*", " ", a)
a <- gsub("[a-z]*ttt[a-z]*", " ", a)
a <- gsub("[a-z]*uuu[a-z]*", " ", a)
a <- gsub("[a-z]*vvv[a-z]*", " ", a)
a <- gsub("[a-z]*www[a-z]*", " ", a)
a <- gsub("[a-z]*xxx[a-z]*", " ", a)
a <- gsub("[a-z]*yyy[a-z]*", " ", a)
a <- gsub("[a-z]*zzz[a-z]*", " ", a)
}
data.smpl <- tm_map(data.smpl, content_transformer(removeCharRepetition))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeCharRepetition)): transformation drops documents
memory_files <- data.frame(files=c("blogsDoc","tweetDoc","newsDoc"),
memory=c(object.size(blogsDoc),object.size(tweetDoc),object.size(newsDoc)))
memory_files
## files memory
## 1 blogsDoc 249772856
## 2 tweetDoc 288149360
## 3 newsDoc 19184280
lines_files <- data.frame(files=c("blogsDoc","tweetDoc","newsDoc"),
lines=c(length(blogsDoc),length(tweetDoc),length(newsDoc)))
lines_files
## files lines
## 1 blogsDoc 899288
## 2 tweetDoc 2360148
## 3 newsDoc 77259
word_stats <- data.frame(files=c(rep("blogsDoc",length(blogsDoc)),
rep("tweetDoc",length(tweetDoc)),
rep("newsDoc",length(newsDoc))),
wordcount=c(str_count(blogsDoc,"\\w+"),
str_count(tweetDoc,"\\w+"),
str_count(newsDoc,"\\w+")))
message("Summary of words in file blogsDoc :")
## Summary of words in file blogsDoc :
print(summary(word_stats[word_stats$file=="blogsDoc","wordcount"]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 9.0 29.0 42.6 61.0 6851.0
message("Summary of words in file tweetDoc :")
## Summary of words in file tweetDoc :
print(summary(word_stats[word_stats$file=="tweetDoc","wordcount"]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 13.14 19.00 47.00
message("Summary of words in file newsDoc :")
## Summary of words in file newsDoc :
print(summary(word_stats[word_stats$file=="newsDoc","wordcount"]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 35.49 47.00 1522.00
message("Total words :")
## Total words :
print(sum(word_stats$wordcount))
## [1] 72054715
words.all <- sort(table(c(blogsDoc.smpl, newsDoc.smpl, tweetDoc.smpl)),decreasing=TRUE) # all word freq
# Head of words in all files
head(words.all)
##
## Thank you! Thanks for the RT! Thanks for the follow!
## 32 30 12
## thank you thank you :) thank you!
## 11 10 9
# Tail of words in all files
tail(words.all)
##
## Zuckerberg as man of the year? What has that stupid geek done that is so great?
## 1
## Zumba-a-thon with 2 of my most favorite ladies <U+0001F46F><U+0001F646><U+0001F483>
## 1
## Zumba Fitness is an exhilarating, effective, easy-to-follow, Latin-inspired, calorie-burning dance fitness-party tonight at 6pm Titans...
## 1
## Zumwalt West's next outing is at 10 a.m. Saturday against Parkway South (2-2) in the opening round of the Visitation Tournament.
## 1
## Zuniga blames the increase in deaths to the harsh winter. He works to get the word out that criminal organizations prey on illegal aliens and treat them as cargo. He says illegal aliens pay coyotes as much as $3,000 to go by land, $9,000 by sea. Often, they are ripped off or left to die.
## 1
## Zusi playing on the right wing tonite?
## 1
# Quantiles of words in all files
q.all <- quantile(words.all,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
print(q.all)
## 0% 25% 50% 75% 80% 95% 99% 100%
## 1 1 1 1 1 1 1 32
# Q-Q plot of words in all files
qqnorm(words.all,main="Normal Q-Q plot of words in all files")
#qqline(words.all)
# Histogram of words in all files
hist(words.all)
As we can see the words frequencies are totally skewed to the right, using only 1% of the words we can solve 91% of the text. This reduced number of words could be the difference of success or failure.
term.doc.matrix <- TermDocumentMatrix(data.smpl)
term.doc.matrix <- as.matrix(term.doc.matrix)
word.freqs <- sort(rowSums(term.doc.matrix), decreasing=TRUE)
dm <- data.frame(word=names(word.freqs), freq=word.freqs)
wordcloud(dm$word, dm$freq, min.freq= 500, random.order=TRUE, rot.per=.25, colors=brewer.pal(8, "Pastel1"))
We find that the data contains many non english characters. We have to identify such tokens and remove them because we dont want to predict them. For this purpose we will use them “tm” package which is extensively used for text mining.
unigram <- NGramTokenizer(data.smpl, Weka_control(min = 1, max = 1))
bigram <- NGramTokenizer(data.smpl, Weka_control(min = 2, max = 2))
trigram <- NGramTokenizer(data.smpl, Weka_control(min = 3, max = 3))
UniGram.df <- data.frame(table(unigram))
BiGram.df <- data.frame(table(bigram))
TriGram.df <- data.frame(table(trigram))
UniGram.df<- UniGram.df[order(UniGram.df$Freq,decreasing = TRUE),]
BiGram.df <- BiGram.df [order(BiGram.df $Freq,decreasing = TRUE),]
TriGram.df <- TriGram.df[order(TriGram.df$Freq,decreasing = TRUE),]
top20.UniGram <- UniGram.df[1:20,]
top20.UniGram
## unigram Freq
## 692 i 97
## 932 na 31
## 832 love 21
## 1420 the 20
## 760 just 19
## 994 one 18
## 601 got 16
## 1556 we 15
## 57 also 13
## 600 good 13
## 736 it 13
## 405 dont 12
## 577 get 12
## 703 im 12
## 1144 really 12
## 1637 you 12
## 213 can 11
## 519 first 11
## 807 like 11
## 819 little 10
top20.BiGram <- BiGram.df [1:20,]
top20.BiGram
## bigram Freq
## 1082 i love 8
## 1566 na na 8
## 1071 i got 6
## 1078 i just 4
## 1095 i really 4
## 1054 i bought 3
## 1090 i need 3
## 2315 the first 3
## 77 also got 2
## 186 bb liner 2
## 213 best pg 2
## 371 cc yes 2
## 465 corn flakes 2
## 527 day weekend 2
## 745 felt like 2
## 834 general synod 2
## 901 good time 2
## 935 guess i 2
## 1050 i also 2
## 1052 i believe 2
top20.TriGram <- TriGram.df[1:20, ]
top20.TriGram
## trigram Freq
## 1870 please reply dm 2
## 2378 the bb liner 2
## 1 a bottle fucking 1
## 2 a great website 1
## 3 a ice cream 1
## 4 aam online fever 1
## 5 abby behind father 1
## 6 abby here reading 1
## 7 able take great 1
## 8 absolutely beautiful we 1
## 9 absolutely love packers 1
## 10 absolutly stay tuned 1
## 11 abuse women left 1
## 12 academy award nominee 1
## 13 acceptance amongst black 1
## 14 according watson wyatt 1
## 15 account fan someone 1
## 16 accredited gambler lose 1
## 17 accusing babying sons 1
## 18 across rink hair 1
ggplot(top20.UniGram, aes(x=reorder(unigram, Freq), y=Freq)) +
geom_bar(stat="Identity",fill="black") + coord_flip() +
xlab("Unigrams") + ylab("Frequency")+
ggtitle("Top 20 unigrams by frequency") +
theme(axis.text.x=element_text(angle=90, hjust=1))
ggplot(top20.BiGram, aes(x=reorder(bigram, Freq), y=Freq)) +
geom_bar(stat="Identity",fill="black") + coord_flip() +
xlab("Bigrams") + ylab("Frequency")+
ggtitle("Top 20 bigrams by frequency") +
theme(axis.text.x=element_text(angle=90, hjust=1))
ggplot(top20.TriGram, aes(x=reorder(trigram, Freq), y=Freq)) +
geom_bar(stat="Identity",fill="black") + coord_flip() +
xlab("Trigrams") + ylab("Frequency")+
ggtitle("Top 20 trigrams by frequency") +
theme(axis.text.x=element_text(angle=90, hjust=1))
My next steps will be: