The Capstone project

Sypnosis
In this project we build an predictive text model that present three options what the next word might be, if people type a sentence on a keyboard.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. The company SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types a sentence, the keyboard presents three options for what the next word might be.

In this project I will analyze a large corpus of text documents to discover the structure in the data and how words are put together. It will clean and analyze the text data and then I will be building and sampling from a predictive text model. Finally, I will build a predictive text product. This RMarkdown is written in order that non Data Scientists understand it as well

The first step in analyzing any new data set is figuring out:
- what data you have
- what are the standard tools and models used for that type of data.

The data is downloaded from Coursera with files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The files have been language filtered but may still contain some foreign text. Besids the files downloaden there are other data sources that might help with this project. The quality of the prediction of text is in this project limited to text I just downloaded. In order to get as much as quality text you can also think of using a library of a English dictionary. Not only the single word will add extra quality but also the explanations of the words in sentences where the word is used.

Tasks to accomplish
- Obtaining the data - Can you download the data and load/manipulate it in R?
- Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.

Familiarizing yourself with NLP and text mining
Natural language processing (NLP) is a set of techniques for using computers to detect in human language the kinds of things that humans detect automatically. For example, when you read a text, you parse a text out into paragraphs and sentences. You do not explicitly label the parts of speech, but you certainly understand them. You notice names of people and places as they come up. And you can tell whether a sentence or paragraph is happy or angry or sad.

This kind of analysis is difficult in any programming language, not least because human language can be so rich and subtle that computer languages cannot capture anywhere near the total amount of information “encoded” in it.

R does have good libraries for natural language processing. Because R is able to interface with other languages like C, C++, and Java, it is possible to use libraries written in those lower-level and hence faster languages, while writing your code in R and taking advantage of its functional programming style and its many other libraries for data analysis. Indeed most of the techniques that are of most use for historians, such as word and sentence tokenization (is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.), n-gram creation , and named entity recogniztion are easily peformed in R.

There are several R packages which make natural langauge processing possible in R. The NLP package provides a set of classes and functions for NLP which are used widely by other packages in R. The openNLP package provides an interface to the Apache OpenNLP library, which is written in Java. RWeka provides an R interface to the Weka data mining software, also written in Java. RWeka is especially useful for creating ngrams.

Common steps in natural language processing
- Lexical Analysis − It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and words.
- Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and arranging words in a manner that shows the relationship among the words. The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
- Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text. The text is checked for meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The semantic analyzer disregards sentence such as “hot ice-cream”.
- Discourse Integration − The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it also brings about the meaning of immediately succeeding sentence.
- Pragmatic Analysis − During this, what was said is re-interpreted on what it actually meant. It involves deriving those aspects of language which require real world knowledge.

Loading the necessary libraries

library(NLP)
library(tm)
library(RWeka)
library(SnowballC)
library(stringi)
library(RColorBrewer)
library(wordcloud)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(slam)
library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

Download, unzip and save the file in the working directory
Before I started I set the workingdirectory and checked the setting. The files for the Capstone project were downloaded as indicated by Coursera, unzipped and stored in the working directory.

setwd("/Users/anknape/Mainfolder/Study/Capstone")

SKData <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
basename(SKData)  # basename gives the name of the file

## [1] "Coursera-SwiftKey.zip"

download.file(SKData, basename(SKData)) # download the files
SwiftKey <- unzip("Coursera-SwiftKey.zip") # unzip and place the files in the working directory
unzip("Coursera-SwiftKey.zip", list = TRUE) # unzip and show the files

##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

setwd("/Users/anknape/Mainfolder/Study/Capstone/final")
list.files()

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

A first look at the data
A directory called “final” is created with four subdirectories called de_DE, ru_RU, en_US, fi_FI. A first look at the data show an enormous file and hopefully the analytics could be done on one set of data as I needed a heavier Mac to complete the work. The en_US blog, twitter and news files already showwed a length of 583.077.241.

Data acquisition and cleaning
- Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
- Profanity filtering - removing profanity and other words you do not want to predict.

From the English, German, Russian and Finnish database I will use the English one. I will get familiar with the set do the necessary cleaning. First I set the directory where the files are downloaded: final/en_US and got some idea about file included.

setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US")

file.info("en_US.blogs.txt")

##                      size isdir mode               mtime
## en_US.blogs.txt 210160014 FALSE  644 2015-12-29 17:50:45
##                               ctime               atime uid gid   uname
## en_US.blogs.txt 2015-12-29 17:50:45 2015-12-29 17:50:45 503  20 anknape
##                 grname
## en_US.blogs.txt  staff

BL = readLines("en_US.blogs.txt")
summary(BL)

##    Length     Class      Mode 
##    899288 character character

file.info("en_US.twitter.txt")

##                        size isdir mode               mtime
## en_US.twitter.txt 167105338 FALSE  644 2015-12-29 17:50:42
##                                 ctime               atime uid gid   uname
## en_US.twitter.txt 2015-12-29 17:50:42 2015-12-29 17:50:42 503  20 anknape
##                   grname
## en_US.twitter.txt  staff

TW = readLines("en_US.twitter.txt")

## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul

summary(TW)

##    Length     Class      Mode 
##   2360148 character character

file.info("en_US.news.txt")

##                     size isdir mode               mtime
## en_US.news.txt 205811889 FALSE  644 2015-12-29 17:50:44
##                              ctime               atime uid gid   uname
## en_US.news.txt 2015-12-29 17:50:44 2015-12-29 17:50:43 503  20 anknape
##                grname
## en_US.news.txt  staff

NW = readLines("en_US.news.txt")
summary(NW)

##    Length     Class      Mode 
##   1010242 character character

A second look at the data
The first impression of the data showed (too) big files and after some first work it show my computer could handle the length. I therefor created a new con files containing the same percentag of the three files, news (1.010.2)42, blog (899.2)88, twitter (2.360.1)48.

setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US")
txtBL <- readLines("en_US.blogs.txt", 8992, encoding = "UTF-8")
txtTW <- readLines("en_US.twitter.txt", 23601, encoding = "UTF-8")
txtNW <- readLines("en_US.news.txt", 10102, encoding = "UTF-8")

Create a directory for write the files to it
To get a Corpus I created a directory (en_US_small) to write the files:en_US.blogs, en_US.twitter, en_US.news to and set the working directory.

setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US")
dir.create("en_US_small")
setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")

writeLines(txtBL, con = "en_US.blogs", sep = " ", useBytes = FALSE)
writeLines(txtTW, con = "en_US.twitter", sep = " ", useBytes = FALSE)
writeLines(txtNW, con = "en_US.news", sep = " ", useBytes = FALSE)

Create a basic report of summary statistics about the data sets
In order to get familiair with the data I did some exploratory analysis by looking at - the first four lines of the blog, twitter and news text, - the length of the blog, twitter and news text, - the number of lines of the blog, twitter and news text, - the number of words of the blog, twitter and news text,

head(txtBL, 4)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."

head(txtTW, 4)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"

head(txtNW, 4)

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."

LBL <- length(txtBL)
LBL

## [1] 8992

LTW <- length(txtTW)
LTW

## [1] 23601

LNW <- length(txtNW)
LNW

## [1] 10102

WBL <- stri_count_words(txtBL)
WTW <- stri_count_words(txtTW)
WNW <- stri_count_words(txtNW)

SWBL <- sum(WBL)
SWTW <- sum(WTW)
SWNW <- sum(WNW)

Overview <- data.frame(Text = c("blog", "twitter", "news"),
                       Lines = c(LBL, LTW, LNW),
                       Words = c(SWBL, SWTW, SWNW))
print(Overview)

##      Text Lines  Words
## 1    blog  8992 370730
## 2 twitter 23601 300257
## 3    news 10102 351572

Create a Corpus of the blog, twitter and news text
I had to create a main structure for managing documents called, Corpus, representing a collection of text documents held fully in memory. After doing so I checked the metadata of the individual Corpuses.

sampleCorpus <- Corpus(DirSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small"), readerControl = list(language = "en_US"))

sampleCorpus[[1]]

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2052980

sampleCorpus[[2]]

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2066016

sampleCorpus[[3]]

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1640470

Clean the Corpus of numbers, punctuation, whitespaces, profane words and remove english stopwords.
After I had a global overview of the data I had to identifying appropriate tokens such as words, punctuation, and numbers and work on the data in order to get a tokenized version of it.

My first goal was to get ride of the profane words and the first problem as a non english native is: what are these?. I found on github a list with words of profane meaning, defined by Google which I used (https://gist.github.com/ryanlewis/a37739d710ccdb4b406d). I first created a directory for the list, downloaded it and read the csv file into it. I cleaned to Corpus from capital letters, numbers, punctuation, whitespaces, profane words and finally remove the English stopwords.

I finally searched for tm methods to clear the text from times and dates but unfortunately I couldn’t find anything about this subject.

setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US")

dir.create("Profanity")
setwd("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/Profanity")
ProfanityFile <- "https://gist.githubusercontent.com/ryanlewis/a37739d710ccdb4b406d/raw/0fbd315eb2900bb736609ea894b9bde8217b991a/google_twunter_lol"
download.file(ProfanityFile, destfile = "profanity.CSV", method = "curl")
profanity <- read.csv("./profanity.CSV", header = FALSE, stringsAsFactors = FALSE)
profanity <- profanity$V1

sampleCorpus <- tm_map(sampleCorpus, content_transformer(tolower))
sampleCorpus <- tm_map(sampleCorpus, removeNumbers)
sampleCorpus <- tm_map(sampleCorpus, removePunctuation)
sampleCorpus <- tm_map(sampleCorpus, stripWhitespace)
sampleCorpus <- tm_map(sampleCorpus, removeWords, profanity)
sampleCorpus <- tm_map(sampleCorpus, removeWords, stopwords("english"))

Exploratory analysis
By building a predictive model it is crucial to understand the text, distribution and relationship between the words, tokens, and phrases in the text. - A exploratory analysis will give understanding about the distribution of words and relationship between the words in the corpora. - Understand frequencies of words and word pairs

I created a term-document matrix from the corpus as a huge amount of R functions can be applied on this matrix . A created both a terms as rows (TDM) and documents as columns (DTM).

TDM <- TermDocumentMatrix(sampleCorpus)
TDM

## <<TermDocumentMatrix (terms: 56729, documents: 3)>>
## Non-/sparse entries: 85202/84985
## Sparsity           : 50%
## Maximal term length: 95
## Weighting          : term frequency (tf)

DTM <- DocumentTermMatrix(sampleCorpus)
DTM

## <<DocumentTermMatrix (documents: 3, terms: 56729)>>
## Non-/sparse entries: 85202/84985
## Sparsity           : 50%
## Maximal term length: 95
## Weighting          : term frequency (tf)

Wordcloud a first detailed impression
To get a first impression of the dataset I used the wordcloud function and selected the first 20 words that occored the most.

wordcloud(sampleCorpus, scale = c(5, 0.75), max.words = 20, random.order = FALSE, rot.per = 0.1, use.r.layout = FALSE, colors = brewer.pal(5, "YlOrRd"))

Frequency: which words are found most
Sure it give some impression but it is more a funny overview used by marketing than it give insight information. I wanted to see that some words are more frequent than others with SMART information, like what are the distributions of word frequencies? I started out with, which and how many words had a frequency of 1000, 2000, 3000 and finally which words occured most.

findFreqTerms(TDM, 1000)

##  [1] "also"   "back"   "can"    "day"    "dont"   "even"   "first" 
##  [8] "get"    "going"  "good"   "got"    "great"  "just"   "know"  
## [15] "last"   "like"   "love"   "make"   "much"   "new"    "now"   
## [22] "one"    "people" "really" "right"  "said"   "see"    "still" 
## [29] "think"  "time"   "today"  "two"    "want"   "way"    "well"  
## [36] "will"   "work"   "year"

findFreqTerms(TDM, 2000)

## [1] "can"  "get"  "just" "like" "one"  "said" "time" "will"

findFreqTerms(TDM, 3000)

## [1] "just" "said" "will"

findFreqTerms(TDM, 3100)

## [1] "will"

Top 20 words that were found most in the Corpus
The word “will” had the highest frequency in the whole corpus. But it didn’t gave me some information which words in decreasing order word found most in the Corpus and used a different method.

freq <- sort(colSums(as.matrix(DTM)), decreasing = TRUE)
FreqOverview <- data.frame(word = names(freq), freq = freq)
row.names(FreqOverview) <- NULL
head(FreqOverview, 20)

##      word freq
## 1    will 3191
## 2    just 3066
## 3    said 3032
## 4     one 2870
## 5    like 2685
## 6     can 2531
## 7     get 2212
## 8    time 2169
## 9     new 1932
## 10   good 1839
## 11    now 1820
## 12   dont 1768
## 13   know 1685
## 14    day 1661
## 15   love 1581
## 16 people 1535
## 17   back 1411
## 18    see 1358
## 19  first 1334
## 20  going 1319

Correlation shows the words related the most found word “will”
This gave a nice overview of the words that were found most in the Corpus with again the word “will” on the first place. Now I would like to understand relationship (association) between the word “will” and other words in the corpora. I used a correlation 95% to see words are related.

findAssocs(DTM, "will", 0.99)

## $will
##        abandon         absent      abundance      accompany       addition 
##           1.00           1.00           1.00           1.00           1.00 
##          adult       advanced      affecting         agenda        alcohol 
##           1.00           1.00           1.00           1.00           1.00 
##        alqaeda     altogether        amusing            ann           anna 
##           1.00           1.00           1.00           1.00           1.00 
##       answered      apartment     appearance      appearing       appetite 
##           1.00           1.00           1.00           1.00           1.00 
##       arguably          aroma          array     attributes     automobile 
##           1.00           1.00           1.00           1.00           1.00 
##          aware          based       battered          begin        beloved 
##           1.00           1.00           1.00           1.00           1.00 
##         beside         beyond        blossom        bombers      breweries 
##           1.00           1.00           1.00           1.00           1.00 
##          brief          burst    businessman         buying      cambridge 
##           1.00           1.00           1.00           1.00           1.00 
##       captured         casual    celebration        chances       chapters 
##           1.00           1.00           1.00           1.00           1.00 
##          chase          child      childhood         choice        choices 
##           1.00           1.00           1.00           1.00           1.00 
##       cilantro        circles         clarke          clash      classical 
##           1.00           1.00           1.00           1.00           1.00 
##        coconut       colorful         combat    combination       comments 
##           1.00           1.00           1.00           1.00           1.00 
##      concerned      concluded      condemned   considerable considerations 
##           1.00           1.00           1.00           1.00           1.00 
##     considered    considering    consumption   contemporary       contents 
##           1.00           1.00           1.00           1.00           1.00 
##     continuing   conventional  conversations      convinced           cope 
##           1.00           1.00           1.00           1.00           1.00 
##    corporation         crafts         creamy        created         crisis 
##           1.00           1.00           1.00           1.00           1.00 
##          crust          daily        dangers        dealing       declared 
##           1.00           1.00           1.00           1.00           1.00 
##         deeply       defiance        deliver    demonstrate       designed 
##           1.00           1.00           1.00           1.00           1.00 
##         detail      developed        devoted        dilemma     disclosure 
##           1.00           1.00           1.00           1.00           1.00 
##        display     distressed           dogs       dragging        drawing 
##           1.00           1.00           1.00           1.00           1.00 
##          drawn         dubbed          dutch           earn            egg 
##           1.00           1.00           1.00           1.00           1.00 
##        embrace       emphasis      empowered            end          ended 
##           1.00           1.00           1.00           1.00           1.00 
##        engaged        enhance       entirely       envelope         essays 
##           1.00           1.00           1.00           1.00           1.00 
##       examined        example    exclusively        existed    experienced 
##           1.00           1.00           1.00           1.00           1.00 
##      explained     explicitly          faces       fairness          false 
##           1.00           1.00           1.00           1.00           1.00 
##       familiar         family        farming        fearful          feast 
##           1.00           1.00           1.00           1.00           1.00 
##       featured           fill          flows           fond         formed 
##           1.00           1.00           1.00           1.00           1.00 
##        forming          found    foundations        founded     frequently 
##           1.00           1.00           1.00           1.00           1.00 
##          fresh            fur         garden        garnish      gathering 
##           1.00           1.00           1.00           1.00           1.00 
##           gave     generation          given           gold         gordon 
##           1.00           1.00           1.00           1.00           1.00 
##         gotten          grant        granted          green        grocery 
##           1.00           1.00           1.00           1.00           1.00 
##          guide            hed          hefty        holding         homage 
##           1.00           1.00           1.00           1.00           1.00 
##         hooked    hospitality         ideals    implemented     inevitable 
##           1.00           1.00           1.00           1.00           1.00 
##        initial          inner     insistence   installation       instance 
##           1.00           1.00           1.00           1.00           1.00 
##      intensity     introduced      inventory      investing     journalist 
##           1.00           1.00           1.00           1.00           1.00 
##        jumping            kim          known        lasting        leaping 
##           1.00           1.00           1.00           1.00           1.00 
##        lightly      lingering           list         listed         loaded 
##           1.00           1.00           1.00           1.00           1.00 
##       location        loosely      magazines           main       maintain 
##           1.00           1.00           1.00           1.00           1.00 
##        marking       meantime       medicine         medium         minors 
##           1.00           1.00           1.00           1.00           1.00 
##           mode        monitor      movements         muscle       navigate 
##           1.00           1.00           1.00           1.00           1.00 
##          noisy           none   observations      obstacles         obtain 
##           1.00           1.00           1.00           1.00           1.00 
##   occasionally          often          older         oldest         onions 
##           1.00           1.00           1.00           1.00           1.00 
##     opposition       outdoors         oxygen         paired    palestinian 
##           1.00           1.00           1.00           1.00           1.00 
##        passage       passions          paths          penny          peter 
##           1.00           1.00           1.00           1.00           1.00 
##     phenomenon         pieces         places    politically       portrait 
##           1.00           1.00           1.00           1.00           1.00 
##      precisely     preferably      preferred       prepared        priests 
##           1.00           1.00           1.00           1.00           1.00 
##          prior       progress       projects      prompting       prophets 
##           1.00           1.00           1.00           1.00           1.00 
##       prospect     protective        proudly           pull       purchase 
##           1.00           1.00           1.00           1.00           1.00 
##         pushed        quickly        quietly          range           rare 
##           1.00           1.00           1.00           1.00           1.00 
##      realities           rear     reasonably  relationships         relied 
##           1.00           1.00           1.00           1.00           1.00 
##         relief      religious       removing       resemble         review 
##           1.00           1.00           1.00           1.00           1.00 
##           rise        rounded        routine           rows          royal 
##           1.00           1.00           1.00           1.00           1.00 
##         sample          sandy          scene         scheme     scientific 
##           1.00           1.00           1.00           1.00           1.00 
##        seating           seed           sees      september           sets 
##           1.00           1.00           1.00           1.00           1.00 
##         sexual         shared          ships        shorter        showing 
##           1.00           1.00           1.00           1.00           1.00 
##           side         silver        similar      situation           size 
##           1.00           1.00           1.00           1.00           1.00 
##          skill       slightly          small       socalled           sofa 
##           1.00           1.00           1.00           1.00           1.00 
##       soldiers           sons       spectrum          spent         spring 
##           1.00           1.00           1.00           1.00           1.00 
##         stated     statements          steps          stood        stopped 
##           1.00           1.00           1.00           1.00           1.00 
##          store         stroke      structure        stunned      subjected 
##           1.00           1.00           1.00           1.00           1.00 
##      suffering        suffers         sunset    sympathetic        tablets 
##           1.00           1.00           1.00           1.00           1.00 
##           tank      teaspoons   temperatures          tends         themes 
##           1.00           1.00           1.00           1.00           1.00 
##           thin    threatening           took        trailer     translated 
##           1.00           1.00           1.00           1.00           1.00 
##         travel        treated         trials          tries           trim 
##           1.00           1.00           1.00           1.00           1.00 
##          tying        unknown          urban          value         values 
##           1.00           1.00           1.00           1.00           1.00 
##      vegetable       versions          vines        visible        visited 
##           1.00           1.00           1.00           1.00           1.00 
##           void          walls       weighing         werent          white 
##           1.00           1.00           1.00           1.00           1.00 
##           wide       widening        witness          woman           yard 
##           1.00           1.00           1.00           1.00           1.00 
##    abandonment       accurate         accuse    acknowledge         action 
##           0.99           0.99           0.99           0.99           0.99 
##     admittedly          aloud        amounts         anchor        anyones 
##           0.99           0.99           0.99           0.99           0.99 
##     appliances    appointment    appropriate         arabia          argue 
##           0.99           0.99           0.99           0.99           0.99 
##          arise            ark          armed           asia         assure 
##           0.99           0.99           0.99           0.99           0.99 
##      attacking         attend          awoke      backwards        barrage 
##           0.99           0.99           0.99           0.99           0.99 
##        barrier          basis      beginning      behaviors         bernie 
##           0.99           0.99           0.99           0.99           0.99 
##           bids         bigger         bonnie     boundaries       branches 
##           0.99           0.99           0.99           0.99           0.99 
##      breakdown     breakfasts         breast       brightly        brought 
##           0.99           0.99           0.99           0.99           0.99 
##         bubble         butter           came         carrot       catalyst 
##           0.99           0.99           0.99           0.99           0.99 
##         caught        causing        changed        cheaper          chewy 
##           0.99           0.99           0.99           0.99           0.99 
##          cocoa    collections        colonel       combines          comfy 
##           0.99           0.99           0.99           0.99           0.99 
##    commissions     complexity  comprehension   considerably      considers 
##           0.99           0.99           0.99           0.99           0.99 
##    constraints        control     convenient    cooperative           corn 
##           0.99           0.99           0.99           0.99           0.99 
##       credible      criminals     critically        cropped        cruiser 
##           0.99           0.99           0.99           0.99           0.99 
##      daughters         davies         debuts        declare         degree 
##           0.99           0.99           0.99           0.99           0.99 
##          denis        dessert      dimension   discouraging         diving 
##           0.99           0.99           0.99           0.99           0.99 
##          doors        dressed          dunes        echoing   economically 
##           0.99           0.99           0.99           0.99           0.99 
##           edgy        effects         eldest            elk        emotion 
##           0.99           0.99           0.99           0.99           0.99 
##     emphasizes      encourage       enforced         entity      envelopes 
##           0.99           0.99           0.99           0.99           0.99 
##     exceptions    explanation      extensive           fate           fats 
##           0.99           0.99           0.99           0.99           0.99 
##      fishermen           fits          flair        flashed          floor 
##           0.99           0.99           0.99           0.99           0.99 
##       flourish         fluffy        footing         forced        fragile 
##           0.99           0.99           0.99           0.99           0.99 
##         framed          freed          fries    frightening    frustration 
##           0.99           0.99           0.99           0.99           0.99 
##     fulfilling          fully        generic     grapefruit          greek 
##           0.99           0.99           0.99           0.99           0.99 
##     guarantees         guards hallucinations          heavy          helps 
##           0.99           0.99           0.99           0.99           0.99 
##      highspeed          hobby         hollow          honor          hosts 
##           0.99           0.99           0.99           0.99           0.99 
##         hunted        illness            imf   incorporated  independently 
##           0.99           0.99           0.99           0.99           0.99 
##    infestation      influence   infringement      initiated      interests 
##           0.99           0.99           0.99           0.99           0.99 
##      interpret interpretation        jarring          jerky         joshua 
##           0.99           0.99           0.99           0.99           0.99 
##         karzai         latter         laurie          least        lebanon 
##           0.99           0.99           0.99           0.99           0.99 
##           lend        letters        leveled         lifted          lined 
##           0.99           0.99           0.99           0.99           0.99 
##         linked         liquid      luxurious      mansfield           many 
##           0.99           0.99           0.99           0.99           0.99 
##            may     middleaged         milieu       missiles       mistrust 
##           0.99           0.99           0.99           0.99           0.99 
##       modified       montreal          moods        motives      mountains 
##           0.99           0.99           0.99           0.99           0.99 
##           move       musician          naive          names       narrowed 
##           0.99           0.99           0.99           0.99           0.99 
##           nash        natalie         nation        natural        nervous 
##           0.99           0.99           0.99           0.99           0.99 
##       numbered    obligations      occupancy     occurrence        october 
##           0.99           0.99           0.99           0.99           0.99 
##    outsourcing       outweigh        painted         palate         pastry 
##           0.99           0.99           0.99           0.99           0.99 
##           pear   performances    permanently     physically        platter 
##           0.99           0.99           0.99           0.99           0.99 
##       politics           pope       preached        present   productivity 
##           0.99           0.99           0.99           0.99           0.99 
##     professors      projector         proves      purchased           push 
##           0.99           0.99           0.99           0.99           0.99 
##          queen           rack        raisins        reacted    rectangular 
##           0.99           0.99           0.99           0.99           0.99 
##  redevelopment     reinforced     relatively         relies         remark 
##           0.99           0.99           0.99           0.99           0.99 
##        remarks       rendered      resembles     resistance        revered 
##           0.99           0.99           0.99           0.99           0.99 
##        ripples           root          ropes         rubbed           sack 
##           0.99           0.99           0.99           0.99           0.99 
##     sacrificed         saison          saudi         search         secure 
##           0.99           0.99           0.99           0.99           0.99 
##       sequence          setup        shallow        shelved       shepherd 
##           0.99           0.99           0.99           0.99           0.99 
##       sheridan        shields       shocking       shrugged       siblings 
##           0.99           0.99           0.99           0.99           0.99 
##         soften       softened          split         sponge         stages 
##           0.99           0.99           0.99           0.99           0.99 
##          sting        stirred         stored      strengths         strewn 
##           0.99           0.99           0.99           0.99           0.99 
##        stripes       stroller       subjects   subsequently       suggests 
##           0.99           0.99           0.99           0.99           0.99 
##         suited    supermarket       survivor       tackling         tactic 
##           0.99           0.99           0.99           0.99           0.99 
##           tail          tangy         target      targeting        testify 
##           0.99           0.99           0.99           0.99           0.99 
##        texture         thighs       towering         traced         tracks 
##           0.99           0.99           0.99           0.99           0.99 
##  traditionally         traits      transport        triumph        turmoil 
##           0.99           0.99           0.99           0.99           0.99 
##         twists       umbrella   unacceptable         uncles  uncomfortable 
##           0.99           0.99           0.99           0.99           0.99 
##           uses      viewpoint       visually         waiter        warming 
##           0.99           0.99           0.99           0.99           0.99 
##           warn         waters        welfare    withdrawals         within 
##           0.99           0.99           0.99           0.99           0.99 
##       wreckage 
##           0.99

Create a function extracting Ngram out of the Corpus and plot the details
Interesting to see which words are related as I didn’t expect these words having a correlation of 95% with the word “will” thus given me good insight of the Corpus.

The wordcloud gave me “some” information about the words that occured most but using the ggplot package a better visualization wil be obtained. I first wrote a function that will compute ngrams for each row of text data in R. After some experiments by myself I found the function on Stackoverflow. http://stackoverflow.com/questions/17556085/compute-ngrams-for-each-row-of-text-data-in-r

This function showed a histogram of the twenty most found single words in the Corpus.

OneT <- function(x) NGramTokenizer(sampleCorpus, 
                                   Weka_control(min = 1, max = 1))
OneGram <- TermDocumentMatrix(Corpus(VectorSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")), 
                              control = list(tokenize = OneT))

OneGram <- as.matrix(rollup(OneGram, 2, na.rm = TRUE, FUN = sum))
OneGram <- data.frame(word = rownames(OneGram), freq = OneGram[, 1])
OneGram <- OneGram[order(-OneGram$freq),  ][1:20, ]
OneGram$word <- factor(OneGram$word, as.character((OneGram$word)))

p <- ggplot(OneGram, aes(x = word, y = freq)) +
        geom_bar(stat = "Identity", fill = "steelblue", color = "red") +
        ggtitle("Count of the One word that is found most") +
        ylab("Found frequency in Corpus (times") +
        xlab("Word found most in Corpus") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
p

Let’s move on to TwoGram
The plot showed a good overview of the single words that were found most in the Corpus. Also how the frequency is divided is shown. I did the same thing for the twenty most found two words connected to each other.

BiT <- function(x) NGramTokenizer(sampleCorpus, 
                                  Weka_control(min = 2, max = 2))
BiGram <- TermDocumentMatrix(Corpus(VectorSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")), 
                             control = list(tokenize = BiT))

BiGram <- as.matrix(rollup(BiGram, 2, na.rm = TRUE, FUN = sum))
BiGram <- data.frame(word = rownames(BiGram), freq = BiGram[, 1])
BiGram <- BiGram[order(-BiGram$freq),  ][1:20, ]
BiGram$word <- factor(BiGram$word, as.character((BiGram$word)))

p <- ggplot(BiGram, aes(x = word, y = freq)) +
        geom_bar(stat = "Identity", fill = "red", color = "Steelblue") +
        ggtitle("Count of the Two words that were found most") +
        ylab("Found frequency in Corpus (times") +
        xlab("Words found most in Corpus") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
p

Let’s get closer to the prediction model with ThriGram
I did the same thing for the twenty most found three words connected to each other.

TriT <- function(x) NGramTokenizer(sampleCorpus, 
                                   Weka_control(min = 3, max = 3))
TriGram <- TermDocumentMatrix(Corpus(VectorSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")), 
                              control = list(tokenize = TriT))

TriGram <- as.matrix(rollup(TriGram, 2, na.rm = TRUE, FUN = sum))
TriGram <- data.frame(word = rownames(TriGram), freq = TriGram[, 1])
TriGram <- TriGram[order(-TriGram$freq),  ][1:20, ]
TriGram$word <- factor(TriGram$word, as.character((TriGram$word)))

p <- ggplot(TriGram, aes(x = word, y = freq)) +
        geom_bar(stat = "Identity", fill = "yellow", color = "black") +
        ggtitle("Count of the three words that were found most") +
        ylab("Found frequency in Corpus (times") +
        xlab("Words found most in Corpus") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
p

Finally the four words found most in the Corpus
I did the same thing for the thirtiest most found four words connected to each other.

FourT <- function(x) NGramTokenizer(sampleCorpus, 
                                   Weka_control(min = 4, max = 4))
FrGram <- TermDocumentMatrix(Corpus(VectorSource("/Users/anknape/Mainfolder/Study/Capstone/final/en_US/en_US_small")), 
                              control = list(tokenize = FourT))

FrGram <- as.matrix(rollup(FrGram, 2, na.rm = TRUE, FUN = sum))
FrGram <- data.frame(word = rownames(FrGram), freq = FrGram[, 1])
FrGram <- FrGram[order(-FrGram$freq),  ][1:20, ]
FrGram$word <- factor(FrGram$word, as.character((FrGram$word)))

p <- ggplot(FrGram, aes(x = word, y = freq)) +
        geom_bar(stat = "Identity", fill = "yellow", color = "black") +
        ggtitle("Count of the four words that were found most") +
        ylab("Found frequency in Corpus (times") +
        xlab("Words found most in Corpus") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
p

Conclusion
My objective for this project is: build a predictive text model that present three options what the next word might be. So far I concluded that the file provided is huge and that the trainingset that I use was small so my computer could handle it. This can cause some noise for the prediction of the words predicted.

I need to clean the Corpus better as I saw a lot “dirt” in the plots and I will also make the Corpus smaller to get better predictions.

I will aslo study the Markov chain (DTMC) that is a stochastic process. The term “Markov chain” refers to the sequence of random words such a process moves through, with the Markov property defining serial dependence between words (as in a “chain”). It can thus be used for predicting the next word through a chain of linked words.

The Capstone project

Author: Lex Knape

Date submitted: 14 December 2015