Capstone project - week 2 report

Getting and cleaning the data

The raw data for this project was downloaded to my working directory from The Capstone data set

This data set includes zipped .txt files, for 4 different languages and 3 data sources for each language: Blogs, News, Twitter. For the purpose of this project I will be working with the 3 English language source files.

# File pathes have been defined in hidden block
dat_blog <- readLines(txt_blog, skipNul = TRUE)
dat_news <- readLines(txt_news, skipNul = TRUE)
dat_twit <- readLines(txt_twit, skipNul = TRUE)

Description of raw data

Number of lines and words in each document

	Number.of.lines	Number.of.words
Blogs	899,288	37,334,441
News	77,259	2,643,972
Twitter	2,360,148	30,373,832

Cleaning the data

Inspection of the data reveals that there are some cleaning actions that need to be carried out before the documents are brought together as a corpus for modeling and analysis.Specifically the removal of non-ASCI characters.

dat_blog2 <- unlist(strsplit(dat_blog, split=", "))
dat_blog3 <- grep("dat_blog2", iconv(dat_blog2, "latin1", "ASCII", sub="dat_blog2"))
dat_blog4 <- dat_blog2[-dat_blog3]
dat_blog5 <- paste(dat_blog4, collapse = ", ")
rm(dat_blog2, dat_blog3, dat_blog4)
dat_news2 <- unlist(strsplit(dat_news, split=", "))
dat_news3 <- grep("dat_news2", iconv(dat_news2, "latin1", "ASCII", sub="dat_news2"))
dat_news4 <- dat_news2[-dat_news3]
dat_news5 <- paste(dat_news4, collapse = ", ")
rm(dat_news2, dat_news3, dat_news4)
dat_twit2 <- unlist(strsplit(dat_twit, split=", "))
dat_twit3 <- grep("dat_twit2", iconv(dat_twit2, "latin1", "ASCII", sub="dat_twit2"))
dat_twit4 <- dat_twit2[-dat_twit3]
dat_twit5 <- paste(dat_twit4, collapse = ", ")
rm(dat_twit2, dat_twit3, dat_twit4)
# Creating a reference to the 3 vector files as input for corpus
vec_source <- c(dat_blog5, dat_news5, dat_twit5)

The rest of the cleaning will be carried out over the document, after they are encorporated into a corpus, using the “TM” package.

# Creating Corpus in tm
swift_corpus1 <- VCorpus(VectorSource(vec_source))
inspect(swift_corpus1)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 153600944
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 14190366
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 161150682

rm(vec_source, dat_blog, dat_blog5, dat_news, dat_news5, dat_twit, dat_twit5)

Cleaning the corpus

Following are the cleaning steps: 1. Remove Hashtags and Twitter handles. 2. Remove URL’s from all documents. 3. Remove white spaces. 4. Transfor all text to lowercase 5. Remove words which are used very frequently (stopwords). (This step is just for the sake of analyzing the corpus. For the actual text prediction model I will be using a corpus which includes stopwords, because I would like to predict next words during use of natural language). 6. Remove curse words. 7. Remove numbers 8. Remove punctuation (This will also be returned to the corpus when building the model.)

# Creating custom functions that will help in cleaning the corpus
removeURL <- content_transformer(function(x) gsub("http:[[:alnum:]]*", "", x))
removeHashTags <- content_transformer(function(x) gsub("#\\S+", "", x))
removeTwitterHandles <-content_transformer( function(x) gsub("@\\S+", "", x))
# Defining profanity dictionary
badwords<-readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
badwords<-badwords[-(which(badwords%in%c("refugee","reject","remains","screw","welfare", "sweetness","shoot","sick","shooting","servant","sex","radical","racial","racist","republican","public","molestation","mexican","looser","lesbian","liberal","kill","killing","killer","heroin","fraud","fire","fight","fairy","^die","death","desire","deposit","crash","^crim","crack","^color","cigarette","church","^christ","canadian", "cancer","^catholic","cemetery","buried","burn","breast","^bomb","^beast","attack","australian","balls","baptist","^addict","abuse","abortion","amateur","asian","aroused","angry","arab","bible")==TRUE))]

swift_corpus1 <- tm_map(swift_corpus1, removeHashTags, lazy = FALSE)
swift_corpus1 <- tm_map(swift_corpus1, removeTwitterHandles, lazy = FALSE)
swift_corpus1 <- tm_map(swift_corpus1, removeURL, lazy = FALSE)
swift_corpus1 <- tm_map(swift_corpus1, stripWhitespace)
swift_corpus1 <- tm_map(swift_corpus1, content_transformer(tolower))
swift_corpus1 <- tm_map(swift_corpus1, removeWords, stopwords("english"))
swift_corpus1 <- tm_map(swift_corpus1, removeWords, badwords)
swift_corpus1 <- tm_map(swift_corpus1, removeNumbers)
swift_corpus1 <- tm_map(swift_corpus1, removePunctuation)

Now that the corpus is clean we can structure it in a way that will enable us to do analysis. This is done by creating a Term/Document Matrix, which tells us, how many times each word appears in each of the documents in the corpus.

# Creating a Term/Document Matrix
swift_tmd1 <- TermDocumentMatrix(swift_corpus1)
inspect(swift_tmd1)

## <<TermDocumentMatrix (terms: 558446, documents: 3)>>
## Non-/sparse entries: 729815/945523
## Sparsity           : 56%
## Maximal term length: 246
## Weighting          : term frequency (tf)
## Sample             :
##       Docs
## Terms      1    2      3
##   can  73348 3892  86312
##   day  39316 2022  87387
##   get  51032 2908 108691
##   good 35675 1997  96953
##   just 72164 3492 145352
##   like 70606 3283 117715
##   love 34724  678 100491
##   one  92580 5685  78796
##   time 66719 3601  73058
##   will 85381 7677  91472

In order to further inspect the corpus, we will create a data frame which shows word frequencies for all the words across documents.

myTdm <- as.matrix(swift_tmd1)
FreqDF <- data.frame(ST = rownames(myTdm), 
          Freq = rowSums(myTdm), 
          row.names = NULL)

Additional corpus analysis

Checking the structure of the corpus in terms of single word frequencies.

When we check how many words in the entire corpus appear less than 10 times:

The charts underline the following information about the corpus:

There is a large group of terms which appear just once in the entire corpus.
The number of terms per frequency drops rapidly, from thousands of terms which appear 2 times to under 50 terms which appear 200 times.
The long tail consists of around 1000 single terms which appear over 5000 times each.
The word which appears the most times in the corpus (Not including stop-words), is “just”. It appears 221,008 times.

Looking at ngrams

For the prediction algorithm I will be using ngrams, which are the different occurences of consecutive n terms in the corpus. Thus, a 2-gram is a term which consists of 2 consecutive single terms, a 3-gram is a term which consists of 3 consecutive single terms. The following will demonstrate the creation and analysis of 2 and 3 gram Document/Term matrices. Some of the phrases in the below mentioned examples may not make sense because of the removal of stop-words. The corpus for the final model will include stop-words. The following example is from a sample of the blog text source.

Example for top Bigrams from the sample:

##                   ST Freq
## 21543  mister rogers   56
## 20226      make sure   45
## 19147     little boy   41
## 23685        one day   41
## 22738       new york   40
## 38904      years ago   38
## 3256       big sword   36
## 18069     last night   31
## 18097      last year   31
## 29536 scrambled eggs   30

Example for top TRigrams from the sample:

##                         ST Freq
## 4401         boy big sword   24
## 21421       little boy big   24
## 25506        new york city   18
## 17900  important places us   14
## 9098  defenders faith like   12
## 12222      faith like ones   12
## 18900   jews nothing worry   12
## 21093       like ones sure   12
## 22165       love toast mom   12
## 25919  nothing worry likes   12

Thoughts about the prediction algorithm

The general idea for the proposed algorithm is to follow the text being typed in by the user and using it to constantly populate 3 slots:

The Unigram slot - will be populated by the last word typed
The Bigram slot - will be populated by the last 2 consecutive terms typed.
The Trigram slot - will be populated by the last 3 consecutive terms typed.

This is an illustration of the slots:

ngram slots

As a term is typed, the algorithm will search 3 Term/Document matrices ( Bigrams, Trigrams and Quadgrams), to find the 3 most probable terms to follow the term in each slot.

From the (up to) 9 words found, suggest to the user the word that comes up the most. If there’s a tie, suggest the word with the highest probabilty.