1. Introduction

This document is presented as part of a capstone project. This assessment required the student to apply data science techniques in the area of natural language processing by building a predictive text application. The application is to be capable of scanning a stream of text as it is typed by a user and suggest possibilities for the next word to be appended to the input stream. The application is to be demonstrated via the Shiny platform, which will allow users to type an input text stream and receive a text predictions within a web based environment.

And this report examines a training dataset that will be used to build a predictive text model. The findings serve as an initial exploratory analysis before taking initial steps to building an n-gram model for text prediction.

2. Getting the Data

The plan on getting the data is to read all the data first from a corpus called HC Corpora (www.corpora.heliohost.org). For this step of exploratory analysis, we have selected a sample(5%) of the data from each documents.

After reading the documents, we prepare a corpus containing sample of all three files of US langauge(blogs, news and tweets). And work only on a sample of the data for the 3 documents (Blogs, News, Twitter).

3. Loading packages and data

R Packages:

library(NLP)
library(tm)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(readtext)
library(stringr)
library(RWeka)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)

Data

The data consist of text from 3 different sources: blogs, news, and twitter feeds and are provided in 4 different languages: German, English (US), Finnish, and Russian. For the remainder of this project, we will use only the the English (US) data sets. Load the US language corpus (blogs, news and tweets) and work only on a 5% sample for the 3 documents (Blogs, News, Twitter):

setwd("C:/Users/rraju/Desktop/Coursera/Capstone project/Coursera-SwiftKey/final/en_US")
data.raw  <-VCorpus( DirSource(directory=".",
                           encoding = "UTF-8"),
                 readerControl = list(language="us"))

blogsDoc = data.raw[[1]]$content
newsDoc = data.raw[[2]]$content
tweetDoc = data.raw[[3]]$content

Examing the first few lines of each data set:

message("Blogs document")
## Blogs document
head(blogsDoc)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"
message("There are ",length(blogsDoc), " lines in original en_US.blogs.txt")
## There are 899288 lines in original en_US.blogs.txt
message("News document")
## News document
head(newsDoc)
## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
message("There are ",length(newsDoc), " lines in original en_US.news.txt")
## There are 77259 lines in original en_US.news.txt
message("Twitter document")
## Twitter document
head(tweetDoc)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
message("There are ",length(tweetDoc), " lines in original en_US.twitter.txt")
## There are 2360148 lines in original en_US.twitter.txt

4. Preliminary Analysis

Some preliminary analysis is conducted on the raw datasets, based on the first Quiz Set.

  1. The en_US.blogs.txt file is how many megabytes?
file.info("C:/Users/rraju/Desktop/Coursera/Capstone project/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/1024^2
## [1] 200.4242
  1. The en_US.twitter.txt has how many lines of text?
length(tweetDoc)
## [1] 2360148
  1. What is the length of the longest line seen in any of the three en_US data sets?
max(nchar(blogsDoc))
## [1] 40833
  1. In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?
val_love <- sum(grepl(pattern = "love", x = tweetDoc))
val_hate <- sum(grepl(pattern = "hate", x = tweetDoc))
val_love / val_hate
## [1] 4.108592
  1. The one tweet in the en_US twitter data set that matches the word “biostats” says what?
tweetDoc[grep(pattern = "biostat", x = tweetDoc)]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
  1. How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)
sum(grepl(pattern = "A computer once beat me at chess, but it was no match for me at kickboxing", x = tweetDoc))
## [1] 3

5. Pre-processing the Data

Preparing a sample from the required dataset.

Due to the scale of each dataset, a 5% sample for the 3 documents(Blogs, News, Twitter).

set.seed(1)
blogsDoc.smpl <- sample(data.raw[[1]]$content, length(data.raw[[1]]$content)*0.05) # Blogs
newsDoc.smpl <- sample(data.raw[[2]]$content, length(data.raw[[2]]$content)*0.05) # News
tweetDoc.smpl <- sample(data.raw[[3]]$content, length(data.raw[[3]]$content)*0.05) # Twitter

message("We keep only 5% (= ",length(blogsDoc.smpl), " lines) for en_US.blogs.txt")
## We keep only 5% (= 44964 lines) for en_US.blogs.txt
message("We keep only 5% (= ",length(newsDoc.smpl), " lines) for en_US.news.txt")
## We keep only 5% (= 3862 lines) for en_US.news.txt
message("We keep only 5% (= ",length(tweetDoc.smpl), " lines) for en_US.twitter.txt")
## We keep only 5% (= 118007 lines) for en_US.twitter.txt
smpl <- c(tweetDoc.smpl, newsDoc.smpl, blogsDoc.smpl)
smpl <- iconv(smpl, 'UTF-8', 'ASCII')
data.smpl <- Corpus(VectorSource(as.data.frame(smpl, stringsAsFactors = FALSE)))

Data cleaning

Before all analysis procedures, all 3 documents need to be free of following things:

  1. Convert from UTF-8 to ASCII.
  2. Remove non ASCII words.
  3. Remove http addressess.
  4. Remove punctuation.
  5. Remove numbers.
  6. Remove english stopwords.
  7. Remove profanity.
  8. Convert text to lowercase.
  9. Remove all whitespace
  10. Remove words with atleast the same 3 consecutive characters(ex: yeeeeees)

1. Convert from UTF-8 to ASCII.

detectNonAsciiChar <- function(x) iconv(x, from="UTF-8", to="ASCII",sub="X")
data.smpl <- tm_map(data.smpl, content_transformer(detectNonAsciiChar))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(detectNonAsciiChar)): transformation drops documents

2. Remove non ASCII words.

removeNonAsciiWord <- function(x) gsub("[a-z]*X+[a-z]*", " ", x)
data.smpl <- tm_map(data.smpl, content_transformer(removeNonAsciiWord))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeNonAsciiWord)): transformation drops documents

3. Remove http addressess.

removeHTTPS <- function(x) gsub("https://(.*)[.][a-z]+|https://[a-z]+", " ", x)
removeHTTP  <- function(x) gsub("http://(.*)[.][a-z]+|https://[a-z]+", " ", x)
removeFTP <- function(x) gsub("ftp://(.*)[.][a-z]+|https://[a-z]+", " ", x)
removeWWW <- function(x) gsub("www(.*)[.][a-z]+|www.", " ", x)
removeHashTag <- function(x) gsub("#[a-z0-9]+", " ", x)
removeTwitterRT <- function(x) gsub("^rt |^rt:", " ", x)

data.smpl <- tm_map(data.smpl, content_transformer(removeHTTPS))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeHTTPS)): transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeHTTP))
## Warning in tm_map.SimpleCorpus(data.smpl, content_transformer(removeHTTP)):
## transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeFTP))
## Warning in tm_map.SimpleCorpus(data.smpl, content_transformer(removeFTP)):
## transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeWWW))
## Warning in tm_map.SimpleCorpus(data.smpl, content_transformer(removeWWW)):
## transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeHashTag))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeHashTag)): transformation drops documents
data.smpl <- tm_map(data.smpl, content_transformer(removeTwitterRT))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeTwitterRT)): transformation drops documents

4. Remove punctuation.

data.smpl <- tm_map(data.smpl, removePunctuation)
## Warning in tm_map.SimpleCorpus(data.smpl, removePunctuation):
## transformation drops documents

5. Remove numbers.

data.smpl <- tm_map(data.smpl, removeNumbers)
## Warning in tm_map.SimpleCorpus(data.smpl, removeNumbers): transformation
## drops documents

6. Remove english stopwords.

data.smpl <- tm_map(data.smpl, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(data.smpl, removeWords,
## stopwords("english")): transformation drops documents

7. Remove profanity words.

profanity_words <- readtext("C:\\Users\\rraju\\Desktop\\Coursera\\Capstone project\\Coursera-SwiftKey\\final\\Smear Words.docx")
data.smpl <- tm_map(data.smpl , removeWords, profanity_words)
## Warning in tm_map.SimpleCorpus(data.smpl, removeWords, profanity_words):
## transformation drops documents

8. Convert text to lowercase.

data.smpl <- tm_map(data.smpl, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(data.smpl, content_transformer(tolower)):
## transformation drops documents

9. Remove all whitespace

data.smpl <- tm_map(data.smpl, stripWhitespace)
## Warning in tm_map.SimpleCorpus(data.smpl, stripWhitespace): transformation
## drops documents

10. Remove words with atleast the same 3 consecutive characters

removeCharRepetition <- function(x) {
  a <- gsub("[a-z]*aaa[a-z]*", " ", x)
  a <- gsub("[a-z]*bbb[a-z]*", " ", a)
  a <- gsub("[a-z]*ccc[a-z]*", " ", a)
  a <- gsub("[a-z]*ddd[a-z]*", " ", a)
  a <- gsub("[a-z]*eee[a-z]*", " ", a)
  a <- gsub("[a-z]*fff[a-z]*", " ", a)
  a <- gsub("[a-z]*ggg[a-z]*", " ", a)
  a <- gsub("[a-z]*hhh[a-z]*", " ", a)
  a <- gsub("[a-z]*iii[a-z]*", " ", a)
  a <- gsub("[a-z]*jjj[a-z]*", " ", a)
  a <- gsub("[a-z]*kkk[a-z]*", " ", a)
  a <- gsub("[a-z]*lll[a-z]*", " ", a)
  a <- gsub("[a-z]*mmm[a-z]*", " ", a)
  a <- gsub("[a-z]*nnn[a-z]*", " ", a)
  a <- gsub("[a-z]*ooo[a-z]*", " ", a)
  a <- gsub("[a-z]*ppp[a-z]*", " ", a)
  a <- gsub("[a-z]*qqq[a-z]*", " ", a)
  a <- gsub("[a-z]*rrr[a-z]*", " ", a)
  a <- gsub("[a-z]*sss[a-z]*", " ", a)
  a <- gsub("[a-z]*ttt[a-z]*", " ", a)
  a <- gsub("[a-z]*uuu[a-z]*", " ", a)
  a <- gsub("[a-z]*vvv[a-z]*", " ", a)
  a <- gsub("[a-z]*www[a-z]*", " ", a)
  a <- gsub("[a-z]*xxx[a-z]*", " ", a)
  a <- gsub("[a-z]*yyy[a-z]*", " ", a)
  a <- gsub("[a-z]*zzz[a-z]*", " ", a)
}

data.smpl <- tm_map(data.smpl, content_transformer(removeCharRepetition))
## Warning in tm_map.SimpleCorpus(data.smpl,
## content_transformer(removeCharRepetition)): transformation drops documents

6. Analyzing the data

Population Data

Data size(in bytes) in memory.

memory_files <- data.frame(files=c("blogsDoc","tweetDoc","newsDoc"),
                         memory=c(object.size(blogsDoc),object.size(tweetDoc),object.size(newsDoc)))
memory_files
##      files    memory
## 1 blogsDoc 249772856
## 2 tweetDoc 288149360
## 3  newsDoc  19184280

Number of lines in each text

lines_files <- data.frame(files=c("blogsDoc","tweetDoc","newsDoc"),
                          lines=c(length(blogsDoc),length(tweetDoc),length(newsDoc)))
lines_files
##      files   lines
## 1 blogsDoc  899288
## 2 tweetDoc 2360148
## 3  newsDoc   77259

Summary information of number of words by line in each text file

word_stats <- data.frame(files=c(rep("blogsDoc",length(blogsDoc)),
                                 rep("tweetDoc",length(tweetDoc)),
                                 rep("newsDoc",length(newsDoc))),
                         wordcount=c(str_count(blogsDoc,"\\w+"),
                                   str_count(tweetDoc,"\\w+"),
                                   str_count(newsDoc,"\\w+")))


message("Summary of words in file blogsDoc :")
## Summary of words in file blogsDoc :
print(summary(word_stats[word_stats$file=="blogsDoc","wordcount"]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     9.0    29.0    42.6    61.0  6851.0
message("Summary of words in file tweetDoc :")
## Summary of words in file tweetDoc :
print(summary(word_stats[word_stats$file=="tweetDoc","wordcount"]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   13.14   19.00   47.00
message("Summary of words in file newsDoc :")
## Summary of words in file newsDoc :
print(summary(word_stats[word_stats$file=="newsDoc","wordcount"]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   35.49   47.00 1522.00
message("Total words :")
## Total words :
print(sum(word_stats$wordcount))
## [1] 72054715

Statistics on all words in 3 documents.

words.all <- sort(table(c(blogsDoc.smpl, newsDoc.smpl, tweetDoc.smpl)),decreasing=TRUE) # all word freq

# Head of words in all files
head(words.all)
## 
##             Thank you!     Thanks for the RT! Thanks for the follow! 
##                     32                     30                     12 
##              thank you           thank you :)             thank you! 
##                     11                     10                      9
# Tail of words in all files
tail(words.all)
## 
##                                                                                                                                                                                                                  Zuckerberg as man of the year? What has that stupid geek done that is so great? 
##                                                                                                                                                                                                                                                                                                1 
##                                                                                                                                                                                                              Zumba-a-thon with 2 of my most favorite ladies <U+0001F46F><U+0001F646><U+0001F483> 
##                                                                                                                                                                                                                                                                                                1 
##                                                                                                                                                        Zumba Fitness is an exhilarating, effective, easy-to-follow, Latin-inspired, calorie-burning dance fitness-party tonight at 6pm Titans... 
##                                                                                                                                                                                                                                                                                                1 
##                                                                                                                                                                 Zumwalt West's next outing is at 10 a.m. Saturday against Parkway South (2-2) in the opening round of the Visitation Tournament. 
##                                                                                                                                                                                                                                                                                                1 
## Zuniga blames the increase in deaths to the harsh winter. He works to get the word out that criminal organizations prey on illegal aliens and treat them as cargo. He says illegal aliens pay coyotes as much as $3,000 to go by land, $9,000 by sea. Often, they are ripped off or left to die. 
##                                                                                                                                                                                                                                                                                                1 
##                                                                                                                                                                                                                                                           Zusi playing on the right wing tonite? 
##                                                                                                                                                                                                                                                                                                1
# Quantiles of words in all files
q.all <- quantile(words.all,probs=c(0,25,50,75,80,95,99,100)/100,type=3)
print(q.all)
##   0%  25%  50%  75%  80%  95%  99% 100% 
##    1    1    1    1    1    1    1   32
# Q-Q plot of words in all files
qqnorm(words.all,main="Normal Q-Q plot of words in all files")

#qqline(words.all)

# Histogram of words in all files
hist(words.all)

As we can see the words frequencies are totally skewed to the right, using only 1% of the words we can solve 91% of the text. This reduced number of words could be the difference of success or failure.

7. Word cloud plot of the most common words in the corpus

term.doc.matrix <- TermDocumentMatrix(data.smpl)
term.doc.matrix <- as.matrix(term.doc.matrix)
word.freqs <- sort(rowSums(term.doc.matrix), decreasing=TRUE) 
dm <- data.frame(word=names(word.freqs), freq=word.freqs)

wordcloud(dm$word, dm$freq, min.freq= 500, random.order=TRUE, rot.per=.25, colors=brewer.pal(8, "Pastel1"))

8. Tokenize the sentences

We find that the data contains many non english characters. We have to identify such tokens and remove them because we dont want to predict them. For this purpose we will use them “tm” package which is extensively used for text mining.

Convert the sample corpus to a data frame to plug into RWeka Tokenizer

unigram <- NGramTokenizer(data.smpl, Weka_control(min = 1, max = 1))
bigram <- NGramTokenizer(data.smpl, Weka_control(min = 2, max = 2)) 
trigram <- NGramTokenizer(data.smpl, Weka_control(min = 3, max = 3)) 

Converting corpus to data frame

UniGram.df <- data.frame(table(unigram))
BiGram.df  <- data.frame(table(bigram))
TriGram.df <- data.frame(table(trigram))

Sorting

UniGram.df<- UniGram.df[order(UniGram.df$Freq,decreasing = TRUE),]
BiGram.df   <- BiGram.df  [order(BiGram.df  $Freq,decreasing = TRUE),]
TriGram.df <- TriGram.df[order(TriGram.df$Freq,decreasing = TRUE),]

Top 20 Uni, Bi and TriGrams

top20.UniGram  <- UniGram.df[1:20,]
top20.UniGram
##      unigram Freq
## 692        i   97
## 932       na   31
## 832     love   21
## 1420     the   20
## 760     just   19
## 994      one   18
## 601      got   16
## 1556      we   15
## 57      also   13
## 600     good   13
## 736       it   13
## 405     dont   12
## 577      get   12
## 703       im   12
## 1144  really   12
## 1637     you   12
## 213      can   11
## 519    first   11
## 807     like   11
## 819   little   10
top20.BiGram <- BiGram.df [1:20,]
top20.BiGram
##             bigram Freq
## 1082        i love    8
## 1566         na na    8
## 1071         i got    6
## 1078        i just    4
## 1095      i really    4
## 1054      i bought    3
## 1090        i need    3
## 2315     the first    3
## 77        also got    2
## 186       bb liner    2
## 213        best pg    2
## 371         cc yes    2
## 465    corn flakes    2
## 527    day weekend    2
## 745      felt like    2
## 834  general synod    2
## 901      good time    2
## 935        guess i    2
## 1050        i also    2
## 1052     i believe    2
top20.TriGram  <- TriGram.df[1:20, ]
top20.TriGram
##                       trigram Freq
## 1870          please reply dm    2
## 2378             the bb liner    2
## 1            a bottle fucking    1
## 2             a great website    1
## 3                 a ice cream    1
## 4            aam online fever    1
## 5          abby behind father    1
## 6           abby here reading    1
## 7             able take great    1
## 8     absolutely beautiful we    1
## 9     absolutely love packers    1
## 10       absolutly stay tuned    1
## 11           abuse women left    1
## 12      academy award nominee    1
## 13   acceptance amongst black    1
## 14     according watson wyatt    1
## 15        account fan someone    1
## 16    accredited gambler lose    1
## 17      accusing babying sons    1
## 18           across rink hair    1

9. Generate n-gram histograms

Unigram frequency

ggplot(top20.UniGram, aes(x=reorder(unigram, Freq), y=Freq)) +
  geom_bar(stat="Identity",fill="black") + coord_flip() +
  xlab("Unigrams") + ylab("Frequency")+
  ggtitle("Top 20 unigrams by frequency") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

Bigram frequency

ggplot(top20.BiGram, aes(x=reorder(bigram, Freq), y=Freq)) +
  geom_bar(stat="Identity",fill="black") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency")+
  ggtitle("Top 20 bigrams by frequency") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

Trigram frequency

ggplot(top20.TriGram, aes(x=reorder(trigram, Freq), y=Freq)) +
  geom_bar(stat="Identity",fill="black") + coord_flip() +
  xlab("Trigrams") + ylab("Frequency")+
  ggtitle("Top 20 trigrams by frequency") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

10. Future Work

My next steps will be:

  1. Create the predictive model and test it .
  2. Create shiny application based on the predictive model.