Sypnosis
In this project we build an predictive text model that present three options what the next word might be, if people type a sentence on a keyboard.
Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. The company SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types a sentence, the keyboard presents three options for what the next word might be.
In this project I will analyze a large corpus of text documents to discover the structure in the data and how words are put together. It will clean and analyze the text data and then I will be building and sampling from a predictive text model. Finally, I will build a predictive text product. This RMarkdown is written in order that non Data Scientists understand it as well
The first step in analyzing any new data set is figuring out:
- what data you have
- what are the standard tools and models used for that type of data.
The data is downloaded from Coursera with files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The files have been language filtered but may still contain some foreign text. Besids the files downloaden there are other data sources that might help with this project. The quality of the prediction of text is in this project limited to text I just downloaded. In order to get as much as quality text you can also think of using a library of a English dictionary. Not only the single word will add extra quality but also the explanations of the words in sentences where the word is used.
Tasks to accomplish
- Obtaining the data - Can you download the data and load/manipulate it in R?
- Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.
Familiarizing yourself with NLP and text mining
Natural language processing (NLP) is a set of techniques for using computers to detect in human language the kinds of things that humans detect automatically. For example, when you read a text, you parse a text out into paragraphs and sentences. You do not explicitly label the parts of speech, but you certainly understand them. You notice names of people and places as they come up. And you can tell whether a sentence or paragraph is happy or angry or sad.
This kind of analysis is difficult in any programming language, not least because human language can be so rich and subtle that computer languages cannot capture anywhere near the total amount of information “encoded” in it.
R does have good libraries for natural language processing. Because R is able to interface with other languages like C, C++, and Java, it is possible to use libraries written in those lower-level and hence faster languages, while writing your code in R and taking advantage of its functional programming style and its many other libraries for data analysis. Indeed most of the techniques that are of most use for historians, such as word and sentence tokenization (is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.), n-gram creation , and named entity recogniztion are easily peformed in R.
There are several R packages which make natural langauge processing possible in R. The NLP package provides a set of classes and functions for NLP which are used widely by other packages in R. The openNLP package provides an interface to the Apache OpenNLP library, which is written in Java. RWeka provides an R interface to the Weka data mining software, also written in Java. RWeka is especially useful for creating ngrams.
Common steps in natural language processing
- Lexical Analysis − It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and words.
- Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and arranging words in a manner that shows the relationship among the words. The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
- Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text. The text is checked for meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The semantic analyzer disregards sentence such as “hot ice-cream”.
- Discourse Integration − The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it also brings about the meaning of immediately succeeding sentence.
- Pragmatic Analysis − During this, what was said is re-interpreted on what it actually meant. It involves deriving those aspects of language which require real world knowledge.
Loading the necessary libraries
library(NLP)
library(tm)
library(RWeka)
library(SnowballC)
library(stringi)
library(RColorBrewer)
library(wordcloud)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(slam)
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
Download, unzip and save the file in the working directory
Before I started I set the workingdirectory and checked the setting. The files for the Capstone project were downloaded as indicated by Coursera, unzipped and stored in the working directory.
setwd("/Users/anknape/Mainfolder/Study/Capstone")
SKData <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
basename(SKData) # basename gives the name of the file
## [1] "Coursera-SwiftKey.zip"
download.file(SKData, basename(SKData)) # download the files
SwiftKey <- unzip("Coursera-SwiftKey.zip") # unzip and place the files in the working directory
unzip("Coursera-SwiftKey.zip", list = TRUE) # unzip and show the files
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
setwd("/Users/anknape/Mainfolder/Study/Capstone/final")
list.files()
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
A first look at the data
A directory called “final” is created with four subdirectories called de_DE, ru_RU, en_US, fi_FI. A first look at the data show an enormous file and hopefully the analytics could be done on one set of data as I needed a heavier Mac to complete the work. The en_US blog, twitter and news files already showwed a length of 583.077.241.
Data acquisition and cleaning
- Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
- Profanity filtering - removing profanity and other words you do not want to predict.
From the English, German, Russian and Finnish database I will use the English one. I will get familiar with the set do the necessary cleaning. First I set the directory where the files are downloaded: final/en_US and got some idea about file included.
setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US")
file.info("en_US.blogs.txt")
## size isdir mode mtime
## en_US.blogs.txt 210160014 FALSE 644 2015-12-29 17:50:45
## ctime atime uid gid uname
## en_US.blogs.txt 2015-12-29 17:50:45 2015-12-29 17:50:45 503 20 anknape
## grname
## en_US.blogs.txt staff
BL = readLines("en_US.blogs.txt")
summary(BL)
## Length Class Mode
## 899288 character character
file.info("en_US.twitter.txt")
## size isdir mode mtime
## en_US.twitter.txt 167105338 FALSE 644 2015-12-29 17:50:42
## ctime atime uid gid uname
## en_US.twitter.txt 2015-12-29 17:50:42 2015-12-29 17:50:42 503 20 anknape
## grname
## en_US.twitter.txt staff
TW = readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul
summary(TW)
## Length Class Mode
## 2360148 character character
file.info("en_US.news.txt")
## size isdir mode mtime
## en_US.news.txt 205811889 FALSE 644 2015-12-29 17:50:44
## ctime atime uid gid uname
## en_US.news.txt 2015-12-29 17:50:44 2015-12-29 17:50:43 503 20 anknape
## grname
## en_US.news.txt staff
NW = readLines("en_US.news.txt")
summary(NW)
## Length Class Mode
## 1010242 character character
A second look at the data
The first impression of the data showed (too) big files and after some first work it show my computer could handle the length. I therefor created a new con files containing the same percentag of the three files, news (1.010.2)42, blog (899.2)88, twitter (2.360.1)48.
setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US")
txtBL <- readLines("en_US.blogs.txt", 8992, encoding = "UTF-8")
txtTW <- readLines("en_US.twitter.txt", 23601, encoding = "UTF-8")
txtNW <- readLines("en_US.news.txt", 10102, encoding = "UTF-8")
Create a directory for write the files to it
To get a Corpus I created a directory (en_US_small) to write the files:en_US.blogs, en_US.twitter, en_US.news to and set the working directory.
setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US")
dir.create("en_US_small")
setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")
writeLines(txtBL, con = "en_US.blogs", sep = " ", useBytes = FALSE)
writeLines(txtTW, con = "en_US.twitter", sep = " ", useBytes = FALSE)
writeLines(txtNW, con = "en_US.news", sep = " ", useBytes = FALSE)
Create a basic report of summary statistics about the data sets
In order to get familiair with the data I did some exploratory analysis by looking at - the first four lines of the blog, twitter and news text, - the length of the blog, twitter and news text, - the number of lines of the blog, twitter and news text, - the number of words of the blog, twitter and news text,
head(txtBL, 4)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
head(txtTW, 4)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
head(txtNW, 4)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
LBL <- length(txtBL)
LBL
## [1] 8992
LTW <- length(txtTW)
LTW
## [1] 23601
LNW <- length(txtNW)
LNW
## [1] 10102
WBL <- stri_count_words(txtBL)
WTW <- stri_count_words(txtTW)
WNW <- stri_count_words(txtNW)
SWBL <- sum(WBL)
SWTW <- sum(WTW)
SWNW <- sum(WNW)
Overview <- data.frame(Text = c("blog", "twitter", "news"),
Lines = c(LBL, LTW, LNW),
Words = c(SWBL, SWTW, SWNW))
print(Overview)
## Text Lines Words
## 1 blog 8992 370730
## 2 twitter 23601 300257
## 3 news 10102 351572
Create a Corpus of the blog, twitter and news text
I had to create a main structure for managing documents called, Corpus, representing a collection of text documents held fully in memory. After doing so I checked the metadata of the individual Corpuses.
sampleCorpus <- Corpus(DirSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small"), readerControl = list(language = "en_US"))
sampleCorpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 2052980
sampleCorpus[[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 2066016
sampleCorpus[[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1640470
Clean the Corpus of numbers, punctuation, whitespaces, profane words and remove english stopwords.
After I had a global overview of the data I had to identifying appropriate tokens such as words, punctuation, and numbers and work on the data in order to get a tokenized version of it.
My first goal was to get ride of the profane words and the first problem as a non english native is: what are these?. I found on github a list with words of profane meaning, defined by Google which I used (https://gist.github.com/ryanlewis/a37739d710ccdb4b406d). I first created a directory for the list, downloaded it and read the csv file into it. I cleaned to Corpus from capital letters, numbers, punctuation, whitespaces, profane words and finally remove the English stopwords.
I finally searched for tm methods to clear the text from times and dates but unfortunately I couldn’t find anything about this subject.
setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US")
dir.create("Profanity")
setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/Profanity")
ProfanityFile <- "https://gist.githubusercontent.com/ryanlewis/a37739d710ccdb4b406d/raw/0fbd315eb2900bb736609ea894b9bde8217b991a/google_twunter_lol"
download.file(ProfanityFile, destfile = "profanity.CSV", method = "curl")
profanity <- read.csv("./profanity.CSV", header = FALSE, stringsAsFactors = FALSE)
profanity <- profanity$V1
sampleCorpus <- tm_map(sampleCorpus, content_transformer(tolower))
sampleCorpus <- tm_map(sampleCorpus, removeNumbers)
sampleCorpus <- tm_map(sampleCorpus, removePunctuation)
sampleCorpus <- tm_map(sampleCorpus, stripWhitespace)
sampleCorpus <- tm_map(sampleCorpus, removeWords, profanity)
sampleCorpus <- tm_map(sampleCorpus, removeWords, stopwords("english"))
Exploratory analysis
By building a predictive model it is crucial to understand the text, distribution and relationship between the words, tokens, and phrases in the text. - A exploratory analysis will give understanding about the distribution of words and relationship between the words in the corpora. - Understand frequencies of words and word pairs
I created a term-document matrix from the corpus as a huge amount of R functions can be applied on this matrix . A created both a terms as rows (TDM) and documents as columns (DTM).
TDM <- TermDocumentMatrix(sampleCorpus)
TDM
## <<TermDocumentMatrix (terms: 56729, documents: 3)>>
## Non-/sparse entries: 85202/84985
## Sparsity : 50%
## Maximal term length: 95
## Weighting : term frequency (tf)
DTM <- DocumentTermMatrix(sampleCorpus)
DTM
## <<DocumentTermMatrix (documents: 3, terms: 56729)>>
## Non-/sparse entries: 85202/84985
## Sparsity : 50%
## Maximal term length: 95
## Weighting : term frequency (tf)
Wordcloud a first detailed impression
To get a first impression of the dataset I used the wordcloud function and selected the first 20 words that occored the most.
wordcloud(sampleCorpus, scale = c(5, 0.75), max.words = 20, random.order = FALSE, rot.per = 0.1, use.r.layout = FALSE, colors = brewer.pal(5, "YlOrRd"))
Frequency: which words are found most
Sure it give some impression but it is more a funny overview used by marketing than it give insight information. I wanted to see that some words are more frequent than others with SMART information, like what are the distributions of word frequencies? I started out with, which and how many words had a frequency of 1000, 2000, 3000 and finally which words occured most.
findFreqTerms(TDM, 1000)
## [1] "also" "back" "can" "day" "dont" "even" "first"
## [8] "get" "going" "good" "got" "great" "just" "know"
## [15] "last" "like" "love" "make" "much" "new" "now"
## [22] "one" "people" "really" "right" "said" "see" "still"
## [29] "think" "time" "today" "two" "want" "way" "well"
## [36] "will" "work" "year"
findFreqTerms(TDM, 2000)
## [1] "can" "get" "just" "like" "one" "said" "time" "will"
findFreqTerms(TDM, 3000)
## [1] "just" "said" "will"
findFreqTerms(TDM, 3100)
## [1] "will"
Top 20 words that were found most in the Corpus
The word “will” had the highest frequency in the whole corpus. But it didn’t gave me some information which words in decreasing order word found most in the Corpus and used a different method.
freq <- sort(colSums(as.matrix(DTM)), decreasing = TRUE)
FreqOverview <- data.frame(word = names(freq), freq = freq)
row.names(FreqOverview) <- NULL
head(FreqOverview, 20)
## word freq
## 1 will 3191
## 2 just 3066
## 3 said 3032
## 4 one 2870
## 5 like 2685
## 6 can 2531
## 7 get 2212
## 8 time 2169
## 9 new 1932
## 10 good 1839
## 11 now 1820
## 12 dont 1768
## 13 know 1685
## 14 day 1661
## 15 love 1581
## 16 people 1535
## 17 back 1411
## 18 see 1358
## 19 first 1334
## 20 going 1319
Correlation shows the words related the most found word “will”
This gave a nice overview of the words that were found most in the Corpus with again the word “will” on the first place. Now I would like to understand relationship (association) between the word “will” and other words in the corpora. I used a correlation 95% to see words are related.
findAssocs(DTM, "will", 0.99)
## $will
## abandon absent abundance accompany addition
## 1.00 1.00 1.00 1.00 1.00
## adult advanced affecting agenda alcohol
## 1.00 1.00 1.00 1.00 1.00
## alqaeda altogether amusing ann anna
## 1.00 1.00 1.00 1.00 1.00
## answered apartment appearance appearing appetite
## 1.00 1.00 1.00 1.00 1.00
## arguably aroma array attributes automobile
## 1.00 1.00 1.00 1.00 1.00
## aware based battered begin beloved
## 1.00 1.00 1.00 1.00 1.00
## beside beyond blossom bombers breweries
## 1.00 1.00 1.00 1.00 1.00
## brief burst businessman buying cambridge
## 1.00 1.00 1.00 1.00 1.00
## captured casual celebration chances chapters
## 1.00 1.00 1.00 1.00 1.00
## chase child childhood choice choices
## 1.00 1.00 1.00 1.00 1.00
## cilantro circles clarke clash classical
## 1.00 1.00 1.00 1.00 1.00
## coconut colorful combat combination comments
## 1.00 1.00 1.00 1.00 1.00
## concerned concluded condemned considerable considerations
## 1.00 1.00 1.00 1.00 1.00
## considered considering consumption contemporary contents
## 1.00 1.00 1.00 1.00 1.00
## continuing conventional conversations convinced cope
## 1.00 1.00 1.00 1.00 1.00
## corporation crafts creamy created crisis
## 1.00 1.00 1.00 1.00 1.00
## crust daily dangers dealing declared
## 1.00 1.00 1.00 1.00 1.00
## deeply defiance deliver demonstrate designed
## 1.00 1.00 1.00 1.00 1.00
## detail developed devoted dilemma disclosure
## 1.00 1.00 1.00 1.00 1.00
## display distressed dogs dragging drawing
## 1.00 1.00 1.00 1.00 1.00
## drawn dubbed dutch earn egg
## 1.00 1.00 1.00 1.00 1.00
## embrace emphasis empowered end ended
## 1.00 1.00 1.00 1.00 1.00
## engaged enhance entirely envelope essays
## 1.00 1.00 1.00 1.00 1.00
## examined example exclusively existed experienced
## 1.00 1.00 1.00 1.00 1.00
## explained explicitly faces fairness false
## 1.00 1.00 1.00 1.00 1.00
## familiar family farming fearful feast
## 1.00 1.00 1.00 1.00 1.00
## featured fill flows fond formed
## 1.00 1.00 1.00 1.00 1.00
## forming found foundations founded frequently
## 1.00 1.00 1.00 1.00 1.00
## fresh fur garden garnish gathering
## 1.00 1.00 1.00 1.00 1.00
## gave generation given gold gordon
## 1.00 1.00 1.00 1.00 1.00
## gotten grant granted green grocery
## 1.00 1.00 1.00 1.00 1.00
## guide hed hefty holding homage
## 1.00 1.00 1.00 1.00 1.00
## hooked hospitality ideals implemented inevitable
## 1.00 1.00 1.00 1.00 1.00
## initial inner insistence installation instance
## 1.00 1.00 1.00 1.00 1.00
## intensity introduced inventory investing journalist
## 1.00 1.00 1.00 1.00 1.00
## jumping kim known lasting leaping
## 1.00 1.00 1.00 1.00 1.00
## lightly lingering list listed loaded
## 1.00 1.00 1.00 1.00 1.00
## location loosely magazines main maintain
## 1.00 1.00 1.00 1.00 1.00
## marking meantime medicine medium minors
## 1.00 1.00 1.00 1.00 1.00
## mode monitor movements muscle navigate
## 1.00 1.00 1.00 1.00 1.00
## noisy none observations obstacles obtain
## 1.00 1.00 1.00 1.00 1.00
## occasionally often older oldest onions
## 1.00 1.00 1.00 1.00 1.00
## opposition outdoors oxygen paired palestinian
## 1.00 1.00 1.00 1.00 1.00
## passage passions paths penny peter
## 1.00 1.00 1.00 1.00 1.00
## phenomenon pieces places politically portrait
## 1.00 1.00 1.00 1.00 1.00
## precisely preferably preferred prepared priests
## 1.00 1.00 1.00 1.00 1.00
## prior progress projects prompting prophets
## 1.00 1.00 1.00 1.00 1.00
## prospect protective proudly pull purchase
## 1.00 1.00 1.00 1.00 1.00
## pushed quickly quietly range rare
## 1.00 1.00 1.00 1.00 1.00
## realities rear reasonably relationships relied
## 1.00 1.00 1.00 1.00 1.00
## relief religious removing resemble review
## 1.00 1.00 1.00 1.00 1.00
## rise rounded routine rows royal
## 1.00 1.00 1.00 1.00 1.00
## sample sandy scene scheme scientific
## 1.00 1.00 1.00 1.00 1.00
## seating seed sees september sets
## 1.00 1.00 1.00 1.00 1.00
## sexual shared ships shorter showing
## 1.00 1.00 1.00 1.00 1.00
## side silver similar situation size
## 1.00 1.00 1.00 1.00 1.00
## skill slightly small socalled sofa
## 1.00 1.00 1.00 1.00 1.00
## soldiers sons spectrum spent spring
## 1.00 1.00 1.00 1.00 1.00
## stated statements steps stood stopped
## 1.00 1.00 1.00 1.00 1.00
## store stroke structure stunned subjected
## 1.00 1.00 1.00 1.00 1.00
## suffering suffers sunset sympathetic tablets
## 1.00 1.00 1.00 1.00 1.00
## tank teaspoons temperatures tends themes
## 1.00 1.00 1.00 1.00 1.00
## thin threatening took trailer translated
## 1.00 1.00 1.00 1.00 1.00
## travel treated trials tries trim
## 1.00 1.00 1.00 1.00 1.00
## tying unknown urban value values
## 1.00 1.00 1.00 1.00 1.00
## vegetable versions vines visible visited
## 1.00 1.00 1.00 1.00 1.00
## void walls weighing werent white
## 1.00 1.00 1.00 1.00 1.00
## wide widening witness woman yard
## 1.00 1.00 1.00 1.00 1.00
## abandonment accurate accuse acknowledge action
## 0.99 0.99 0.99 0.99 0.99
## admittedly aloud amounts anchor anyones
## 0.99 0.99 0.99 0.99 0.99
## appliances appointment appropriate arabia argue
## 0.99 0.99 0.99 0.99 0.99
## arise ark armed asia assure
## 0.99 0.99 0.99 0.99 0.99
## attacking attend awoke backwards barrage
## 0.99 0.99 0.99 0.99 0.99
## barrier basis beginning behaviors bernie
## 0.99 0.99 0.99 0.99 0.99
## bids bigger bonnie boundaries branches
## 0.99 0.99 0.99 0.99 0.99
## breakdown breakfasts breast brightly brought
## 0.99 0.99 0.99 0.99 0.99
## bubble butter came carrot catalyst
## 0.99 0.99 0.99 0.99 0.99
## caught causing changed cheaper chewy
## 0.99 0.99 0.99 0.99 0.99
## cocoa collections colonel combines comfy
## 0.99 0.99 0.99 0.99 0.99
## commissions complexity comprehension considerably considers
## 0.99 0.99 0.99 0.99 0.99
## constraints control convenient cooperative corn
## 0.99 0.99 0.99 0.99 0.99
## credible criminals critically cropped cruiser
## 0.99 0.99 0.99 0.99 0.99
## daughters davies debuts declare degree
## 0.99 0.99 0.99 0.99 0.99
## denis dessert dimension discouraging diving
## 0.99 0.99 0.99 0.99 0.99
## doors dressed dunes echoing economically
## 0.99 0.99 0.99 0.99 0.99
## edgy effects eldest elk emotion
## 0.99 0.99 0.99 0.99 0.99
## emphasizes encourage enforced entity envelopes
## 0.99 0.99 0.99 0.99 0.99
## exceptions explanation extensive fate fats
## 0.99 0.99 0.99 0.99 0.99
## fishermen fits flair flashed floor
## 0.99 0.99 0.99 0.99 0.99
## flourish fluffy footing forced fragile
## 0.99 0.99 0.99 0.99 0.99
## framed freed fries frightening frustration
## 0.99 0.99 0.99 0.99 0.99
## fulfilling fully generic grapefruit greek
## 0.99 0.99 0.99 0.99 0.99
## guarantees guards hallucinations heavy helps
## 0.99 0.99 0.99 0.99 0.99
## highspeed hobby hollow honor hosts
## 0.99 0.99 0.99 0.99 0.99
## hunted illness imf incorporated independently
## 0.99 0.99 0.99 0.99 0.99
## infestation influence infringement initiated interests
## 0.99 0.99 0.99 0.99 0.99
## interpret interpretation jarring jerky joshua
## 0.99 0.99 0.99 0.99 0.99
## karzai latter laurie least lebanon
## 0.99 0.99 0.99 0.99 0.99
## lend letters leveled lifted lined
## 0.99 0.99 0.99 0.99 0.99
## linked liquid luxurious mansfield many
## 0.99 0.99 0.99 0.99 0.99
## may middleaged milieu missiles mistrust
## 0.99 0.99 0.99 0.99 0.99
## modified montreal moods motives mountains
## 0.99 0.99 0.99 0.99 0.99
## move musician naive names narrowed
## 0.99 0.99 0.99 0.99 0.99
## nash natalie nation natural nervous
## 0.99 0.99 0.99 0.99 0.99
## numbered obligations occupancy occurrence october
## 0.99 0.99 0.99 0.99 0.99
## outsourcing outweigh painted palate pastry
## 0.99 0.99 0.99 0.99 0.99
## pear performances permanently physically platter
## 0.99 0.99 0.99 0.99 0.99
## politics pope preached present productivity
## 0.99 0.99 0.99 0.99 0.99
## professors projector proves purchased push
## 0.99 0.99 0.99 0.99 0.99
## queen rack raisins reacted rectangular
## 0.99 0.99 0.99 0.99 0.99
## redevelopment reinforced relatively relies remark
## 0.99 0.99 0.99 0.99 0.99
## remarks rendered resembles resistance revered
## 0.99 0.99 0.99 0.99 0.99
## ripples root ropes rubbed sack
## 0.99 0.99 0.99 0.99 0.99
## sacrificed saison saudi search secure
## 0.99 0.99 0.99 0.99 0.99
## sequence setup shallow shelved shepherd
## 0.99 0.99 0.99 0.99 0.99
## sheridan shields shocking shrugged siblings
## 0.99 0.99 0.99 0.99 0.99
## soften softened split sponge stages
## 0.99 0.99 0.99 0.99 0.99
## sting stirred stored strengths strewn
## 0.99 0.99 0.99 0.99 0.99
## stripes stroller subjects subsequently suggests
## 0.99 0.99 0.99 0.99 0.99
## suited supermarket survivor tackling tactic
## 0.99 0.99 0.99 0.99 0.99
## tail tangy target targeting testify
## 0.99 0.99 0.99 0.99 0.99
## texture thighs towering traced tracks
## 0.99 0.99 0.99 0.99 0.99
## traditionally traits transport triumph turmoil
## 0.99 0.99 0.99 0.99 0.99
## twists umbrella unacceptable uncles uncomfortable
## 0.99 0.99 0.99 0.99 0.99
## uses viewpoint visually waiter warming
## 0.99 0.99 0.99 0.99 0.99
## warn waters welfare withdrawals within
## 0.99 0.99 0.99 0.99 0.99
## wreckage
## 0.99
Create a function extracting Ngram out of the Corpus and plot the details
Interesting to see which words are related as I didn’t expect these words having a correlation of 95% with the word “will” thus given me good insight of the Corpus.
The wordcloud gave me “some” information about the words that occured most but using the ggplot package a better visualization wil be obtained. I first wrote a function that will compute ngrams for each row of text data in R. After some experiments by myself I found the function on Stackoverflow. http://stackoverflow.com/questions/17556085/compute-ngrams-for-each-row-of-text-data-in-r
This function showed a histogram of the twenty most found single words in the Corpus.
OneT <- function(x) NGramTokenizer(sampleCorpus,
Weka_control(min = 1, max = 1))
OneGram <- TermDocumentMatrix(Corpus(VectorSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")),
control = list(tokenize = OneT))
OneGram <- as.matrix(rollup(OneGram, 2, na.rm = TRUE, FUN = sum))
OneGram <- data.frame(word = rownames(OneGram), freq = OneGram[, 1])
OneGram <- OneGram[order(-OneGram$freq), ][1:20, ]
OneGram$word <- factor(OneGram$word, as.character((OneGram$word)))
p <- ggplot(OneGram, aes(x = word, y = freq)) +
geom_bar(stat = "Identity", fill = "steelblue", color = "red") +
ggtitle("Count of the One word that is found most") +
ylab("Found frequency in Corpus (times") +
xlab("Word found most in Corpus") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p
Let’s move on to TwoGram
The plot showed a good overview of the single words that were found most in the Corpus. Also how the frequency is divided is shown. I did the same thing for the twenty most found two words connected to each other.
BiT <- function(x) NGramTokenizer(sampleCorpus,
Weka_control(min = 2, max = 2))
BiGram <- TermDocumentMatrix(Corpus(VectorSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")),
control = list(tokenize = BiT))
BiGram <- as.matrix(rollup(BiGram, 2, na.rm = TRUE, FUN = sum))
BiGram <- data.frame(word = rownames(BiGram), freq = BiGram[, 1])
BiGram <- BiGram[order(-BiGram$freq), ][1:20, ]
BiGram$word <- factor(BiGram$word, as.character((BiGram$word)))
p <- ggplot(BiGram, aes(x = word, y = freq)) +
geom_bar(stat = "Identity", fill = "red", color = "Steelblue") +
ggtitle("Count of the Two words that were found most") +
ylab("Found frequency in Corpus (times") +
xlab("Words found most in Corpus") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p
Let’s get closer to the prediction model with ThriGram
I did the same thing for the twenty most found three words connected to each other.
TriT <- function(x) NGramTokenizer(sampleCorpus,
Weka_control(min = 3, max = 3))
TriGram <- TermDocumentMatrix(Corpus(VectorSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")),
control = list(tokenize = TriT))
TriGram <- as.matrix(rollup(TriGram, 2, na.rm = TRUE, FUN = sum))
TriGram <- data.frame(word = rownames(TriGram), freq = TriGram[, 1])
TriGram <- TriGram[order(-TriGram$freq), ][1:20, ]
TriGram$word <- factor(TriGram$word, as.character((TriGram$word)))
p <- ggplot(TriGram, aes(x = word, y = freq)) +
geom_bar(stat = "Identity", fill = "yellow", color = "black") +
ggtitle("Count of the three words that were found most") +
ylab("Found frequency in Corpus (times") +
xlab("Words found most in Corpus") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p
Finally the four words found most in the Corpus
I did the same thing for the thirtiest most found four words connected to each other.
FourT <- function(x) NGramTokenizer(sampleCorpus,
Weka_control(min = 4, max = 4))
FrGram <- TermDocumentMatrix(Corpus(VectorSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")),
control = list(tokenize = FourT))
FrGram <- as.matrix(rollup(FrGram, 2, na.rm = TRUE, FUN = sum))
FrGram <- data.frame(word = rownames(FrGram), freq = FrGram[, 1])
FrGram <- FrGram[order(-FrGram$freq), ][1:20, ]
FrGram$word <- factor(FrGram$word, as.character((FrGram$word)))
p <- ggplot(FrGram, aes(x = word, y = freq)) +
geom_bar(stat = "Identity", fill = "yellow", color = "black") +
ggtitle("Count of the four words that were found most") +
ylab("Found frequency in Corpus (times") +
xlab("Words found most in Corpus") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p
Conclusion
My objective for this project is: build a predictive text model that present three options what the next word might be. So far I concluded that the file provided is huge and that the trainingset that I use was small so my computer could handle it. This can cause some noise for the prediction of the words predicted.
I need to clean the Corpus better as I saw a lot “dirt” in the plots and I will also make the Corpus smaller to get better predictions.
I will aslo study the Markov chain (DTMC) that is a stochastic process. The term “Markov chain” refers to the sequence of random words such a process moves through, with the Markov property defining serial dependence between words (as in a “chain”). It can thus be used for predicting the next word through a chain of linked words.