This report is a summary report for the second week of work on the Capstone project, in the Data Acience specialization, on Coursera.
The goal of the report is to demonstrate the downloading of the project data, strucuring the data for analysis, cleaning and prepairing the data, some summary statistics about the data and my initial plans for the proposed text prediction algorithm and app.
The raw data for this project was downloaded to my working directory from The Capstone data set
This data set includes zipped .txt files, for 4 different languages and 3 data sources for each language: Blogs, News, Twitter. For the purpose of this project I will be working with the 3 English language source files.
# File pathes have been defined in hidden block
dat_blog <- readLines(txt_blog, skipNul = TRUE)
dat_news <- readLines(txt_news, skipNul = TRUE)
dat_twit <- readLines(txt_twit, skipNul = TRUE)
Number of lines and words in each document
| Number.of.lines | Number.of.words | |
|---|---|---|
| Blogs | 899,288 | 37,334,441 |
| News | 77,259 | 2,643,972 |
| 2,360,148 | 30,373,832 |
Inspection of the data reveals that there are some cleaning actions that need to be carried out before the documents are brought together as a corpus for modeling and analysis.Specifically the removal of non-ASCI characters.
dat_blog2 <- unlist(strsplit(dat_blog, split=", "))
dat_blog3 <- grep("dat_blog2", iconv(dat_blog2, "latin1", "ASCII", sub="dat_blog2"))
dat_blog4 <- dat_blog2[-dat_blog3]
dat_blog5 <- paste(dat_blog4, collapse = ", ")
rm(dat_blog2, dat_blog3, dat_blog4)
dat_news2 <- unlist(strsplit(dat_news, split=", "))
dat_news3 <- grep("dat_news2", iconv(dat_news2, "latin1", "ASCII", sub="dat_news2"))
dat_news4 <- dat_news2[-dat_news3]
dat_news5 <- paste(dat_news4, collapse = ", ")
rm(dat_news2, dat_news3, dat_news4)
dat_twit2 <- unlist(strsplit(dat_twit, split=", "))
dat_twit3 <- grep("dat_twit2", iconv(dat_twit2, "latin1", "ASCII", sub="dat_twit2"))
dat_twit4 <- dat_twit2[-dat_twit3]
dat_twit5 <- paste(dat_twit4, collapse = ", ")
rm(dat_twit2, dat_twit3, dat_twit4)
# Creating a reference to the 3 vector files as input for corpus
vec_source <- c(dat_blog5, dat_news5, dat_twit5)
The rest of the cleaning will be carried out over the document, after they are encorporated into a corpus, using the “TM” package.
# Creating Corpus in tm
swift_corpus1 <- VCorpus(VectorSource(vec_source))
inspect(swift_corpus1)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 153600944
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 14190366
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 161150682
rm(vec_source, dat_blog, dat_blog5, dat_news, dat_news5, dat_twit, dat_twit5)
Following are the cleaning steps: 1. Remove Hashtags and Twitter handles. 2. Remove URL’s from all documents. 3. Remove white spaces. 4. Transfor all text to lowercase 5. Remove words which are used very frequently (stopwords). (This step is just for the sake of analyzing the corpus. For the actual text prediction model I will be using a corpus which includes stopwords, because I would like to predict next words during use of natural language). 6. Remove curse words. 7. Remove numbers 8. Remove punctuation (This will also be returned to the corpus when building the model.)
# Creating custom functions that will help in cleaning the corpus
removeURL <- content_transformer(function(x) gsub("http:[[:alnum:]]*", "", x))
removeHashTags <- content_transformer(function(x) gsub("#\\S+", "", x))
removeTwitterHandles <-content_transformer( function(x) gsub("@\\S+", "", x))
# Defining profanity dictionary
badwords<-readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
badwords<-badwords[-(which(badwords%in%c("refugee","reject","remains","screw","welfare", "sweetness","shoot","sick","shooting","servant","sex","radical","racial","racist","republican","public","molestation","mexican","looser","lesbian","liberal","kill","killing","killer","heroin","fraud","fire","fight","fairy","^die","death","desire","deposit","crash","^crim","crack","^color","cigarette","church","^christ","canadian", "cancer","^catholic","cemetery","buried","burn","breast","^bomb","^beast","attack","australian","balls","baptist","^addict","abuse","abortion","amateur","asian","aroused","angry","arab","bible")==TRUE))]
swift_corpus1 <- tm_map(swift_corpus1, removeHashTags, lazy = FALSE)
swift_corpus1 <- tm_map(swift_corpus1, removeTwitterHandles, lazy = FALSE)
swift_corpus1 <- tm_map(swift_corpus1, removeURL, lazy = FALSE)
swift_corpus1 <- tm_map(swift_corpus1, stripWhitespace)
swift_corpus1 <- tm_map(swift_corpus1, content_transformer(tolower))
swift_corpus1 <- tm_map(swift_corpus1, removeWords, stopwords("english"))
swift_corpus1 <- tm_map(swift_corpus1, removeWords, badwords)
swift_corpus1 <- tm_map(swift_corpus1, removeNumbers)
swift_corpus1 <- tm_map(swift_corpus1, removePunctuation)
Now that the corpus is clean we can structure it in a way that will enable us to do analysis. This is done by creating a Term/Document Matrix, which tells us, how many times each word appears in each of the documents in the corpus.
# Creating a Term/Document Matrix
swift_tmd1 <- TermDocumentMatrix(swift_corpus1)
inspect(swift_tmd1)
## <<TermDocumentMatrix (terms: 558446, documents: 3)>>
## Non-/sparse entries: 729815/945523
## Sparsity : 56%
## Maximal term length: 246
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 1 2 3
## can 73348 3892 86312
## day 39316 2022 87387
## get 51032 2908 108691
## good 35675 1997 96953
## just 72164 3492 145352
## like 70606 3283 117715
## love 34724 678 100491
## one 92580 5685 78796
## time 66719 3601 73058
## will 85381 7677 91472
In order to further inspect the corpus, we will create a data frame which shows word frequencies for all the words across documents.
myTdm <- as.matrix(swift_tmd1)
FreqDF <- data.frame(ST = rownames(myTdm),
Freq = rowSums(myTdm),
row.names = NULL)
Checking the structure of the corpus in terms of single word frequencies.
When we check how many words in the entire corpus appear less than 10 times:
The charts underline the following information about the corpus:
For the prediction algorithm I will be using ngrams, which are the different occurences of consecutive n terms in the corpus. Thus, a 2-gram is a term which consists of 2 consecutive single terms, a 3-gram is a term which consists of 3 consecutive single terms. The following will demonstrate the creation and analysis of 2 and 3 gram Document/Term matrices. Some of the phrases in the below mentioned examples may not make sense because of the removal of stop-words. The corpus for the final model will include stop-words. The following example is from a sample of the blog text source.
Example for top Bigrams from the sample:
## ST Freq
## 21543 mister rogers 56
## 20226 make sure 45
## 19147 little boy 41
## 23685 one day 41
## 22738 new york 40
## 38904 years ago 38
## 3256 big sword 36
## 18069 last night 31
## 18097 last year 31
## 29536 scrambled eggs 30
Example for top TRigrams from the sample:
## ST Freq
## 4401 boy big sword 24
## 21421 little boy big 24
## 25506 new york city 18
## 17900 important places us 14
## 9098 defenders faith like 12
## 12222 faith like ones 12
## 18900 jews nothing worry 12
## 21093 like ones sure 12
## 22165 love toast mom 12
## 25919 nothing worry likes 12
The general idea for the proposed algorithm is to follow the text being typed in by the user and using it to constantly populate 3 slots:
This is an illustration of the slots:
ngram slots
As a term is typed, the algorithm will search 3 Term/Document matrices ( Bigrams, Trigrams and Quadgrams), to find the 3 most probable terms to follow the term in each slot.
From the (up to) 9 words found, suggest to the user the word that comes up the most. If there’s a tie, suggest the word with the highest probabilty.