Capstone Project - Progress Report

B. McCracken - Tech Entrepreneur

Capstone Progress Report

This is the progress report for John Hopkins University Data Science Specialization Capstone course. The project is focused on demonstration of the use of Natural Language Processing tools to build a model to predict the next word typed in a sentence. The project uses several language processing packages:

  • tm: used to read the corpus of documents in a folder and create a vCorpus
  • quanteda: used in this excercise to show summary statistics of the corpus
  • rweka: used to create a tokenizer and n-grams from a TermDocumentMatrix
  • dplyr and grid.Extra: used to identify and plot most used terms and n-grams

The tm package was the primary package utilized in this project

Prepare Environment and Download Data

The first step in the project is to prepare the environment there are three steps: 1. Load packages needed 2. Download the data if needed 3. Create a sample directory to hold the sample corpus

packages<-function(x){    #function to detemine if needed packages are installed
  x<-as.character(match.call()[[2]])
  if (!require(x,character.only=TRUE)){
    install.packages(pkgs=x,repos="http://cran.r-project.org")
    require(x,character.only=TRUE)
  }
}

packages(dplyr); packages (downloader); packages(NLP); packages(openNLP); packages(RWeka)
packages(stringr); packages(stringi); packages(tm); packages(quanteda); packages(ggplot2)
packages(gridExtra); remove(packages)

#Step 2. Download the data for the project.  Test to see if it already exist first since this a 510MB file
if (!file.exists("dataset.zip")) {
    fileurl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    download(fileurl, dest="dataset.zip", mode="wb") 
    unzip ("dataset.zip", exdir = ".")  # this command creates a directory "????" that contains the data and extracts the files 
}
#Step 3. Test to see if the sample directory already exist if the code is run more than once 

if (!dir.exists("./sample")) {
    dir.create("./sample")
}    

The data set from SwiftKey allows for analysis of multiple languages English, German, Russian and Finnish This project will focus on the English version. The entire corpus of three documents is 510Mb which is too large to manipulate. A sample of each document is selected and then written out in a sample directory.

#sample the twitter file
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul=T) 
twitter_s <- sample(twitter, length(twitter) * 0.001); remove(twitter)

#sample the blogs file
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul=T) 
blogs_s <- sample(blogs, length(blogs) * 0.001); remove(blogs)

#sample the news file
news <- readLines("./final/en_US/en_US.news.txt", encoding="UTF-8", skipNul=TRUE) 
news_s <- sample(news, length(news) * 0.001); remove(news)

#write the files to a folder
writeLines(blogs_s, con = "./sample/en_US.blogs.txt", sep = "\n", useBytes = F)
writeLines(twitter_s, con = "./sample/en_US.twitter.txt", sep = "\n", useBytes = F)
writeLines(news_s, con = "./sample/en_US.news.txt", sep = "\n", useBytes = F)

Create and Analyze Corpus

Analyze Corpus: The corpus has three files. A US News file, blog file extracts and a file of tweets. A sample of all the files was chosen for analysis to build or broader range of phrases and terms upon which to build a aplication dictionary. In cleaning the data, stemming and removing stop words produced somewhat meaninless words and phrases.

#================================
#The code uses tm to create a corpus of the documents in a directory and then it uses quanteda command to create a corpus
##================================
cname <- file.path("./sample/"); docs <- Corpus(DirSource(cname)); myCorpus <- corpus(docs) 

#=======Create document feature matrix
myDfm <- dfm(myCorpus, ignoredFeatures = stopwords("english"), stem = FALSE)  #remove punctuation, make lowercase, index
top10 <- topfeatures(myDfm, 10); top10df <- data.frame(ngram=names(top10), occurrences=top10); top10df <- arrange(top10df, desc(occurrences)) 

#======Cleaning the tm Vcorpus before creating ngrams 
toEmpty <- content_transformer(function(x, pattern) gsub(pattern, "", x))
docs <- tm_map(docs, toEmpty, "#\\w+"); docs <- tm_map(docs, removePunctuation); docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers); docs <- tm_map(docs, stripWhitespace)

summary(myCorpus); topfeatures(myDfm, 10)
Corpus consisting of 3 documents.

  Text Types Tokens Sentences author       datetimestamp description
 text1  8162  42449      2029   <NA> 2016-08-29 02:28:37        <NA>
 text2  1305   2827       120   <NA> 2016-08-29 02:28:37        <NA>
 text3  7801  37650      2555   <NA> 2016-08-29 02:28:37        <NA>
 heading                id language origin
    <NA>   en_US.blogs.txt       en   <NA>
    <NA>    en_US.news.txt       en   <NA>
    <NA> en_US.twitter.txt       en   <NA>

Source:  Converted from tm VCorpus 'docs'
Created: Sun Aug 28 19:28:37 2016
Notes:   
just  one    u will like  can  get time  day good 
 248  229  200  197  192  189  176  176  170  162 

Create and Plot Most Common Words and Phrases (NGRAMS)

Create and Plot NGRAMS: RWeka was used to created a tokenizer from 2,3 and 4 word n-grams could be extracted. ggplot was used to plot the top 10 items of the 1-4 word ngrams. The three and four word ngrams start to look more like news stories than tweets as in the one and two word n-grams

#=================NGRAM_Tokenizer====================================
ngram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n_gram, max= n_gram)) #===========2-word ngrams =============
n_gram =2; bigramtdm <- TermDocumentMatrix(docs, control = list(tokenize = ngram_tokenizer))
bigramtdm.matrix <- as.matrix(bigramtdm,rownames.force); bitopwords <- rowSums(bigramtdm.matrix)
bitop10 <- head(sort(bitopwords, decreasing = TRUE),10); bigramdf <- data.frame(ngram=names(bitop10), occurrences=bitop10)
bigramdf <- arrange(bigramdf, desc(occurrences)) 
#===========3-word ngrams =============
n_gram =3; trigramtdm <- TermDocumentMatrix(docs, control = list(tokenize = ngram_tokenizer)) 
trigramtdm.matrix <- as.matrix(trigramtdm); tritopwords <- rowSums(trigramtdm.matrix)
tritop10 <-head(sort(tritopwords, decreasing = TRUE),10); trigramdf <- data.frame(ngram=names(tritop10), occurrences=tritop10)
trigramdf <- arrange(trigramdf, desc(occurrences)) 
#===========4-word ngrams =============
n_gram =4; quadgramtdm <- TermDocumentMatrix(docs, control = list(tokenize = ngram_tokenizer))
quadgramtdm.matrix <- as.matrix(quadgramtdm); quadtopwords <- rowSums(quadgramtdm.matrix)
quadtop10 <- head(sort(quadtopwords, decreasing = TRUE),10); quadgramdf <- data.frame(ngram=names(quadtop10), occurrences=quadtop10)
quadgramdf <- arrange(quadgramdf, desc(occurrences)) 

#================================Ploting Ngrams - setting up for grid extra
g1<- ggplot(top10df, aes(x=reorder(ngram, -occurrences), y=occurrences)) 
graph1a<- g1 + geom_bar(fill="grey", color="black", stat="identity") + theme_bw() + ylab("Occurences") + xlab("Top Words In Corpus")
graph1b<- graph1a + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2<- ggplot(bigramdf, aes(x=reorder(ngram, -occurrences), y=occurrences)) 
graph2a<- g2 + geom_bar(fill="steelblue1", color="blue", stat="identity") + theme_bw() + ylab("Occurences") + xlab("2 Word N-Grams")
graph2b<- graph2a + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g3<- ggplot(trigramdf, aes(x=reorder(ngram, -occurrences), y=occurrences)) 
graph3a<- g3 + geom_bar(fill="gold2", color="darkgoldenrod4", stat="identity") + theme_bw() + ylab("Occurences") + xlab("3 Word N-Grams")
graph3b<- graph3a + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g4<- ggplot(quadgramdf, aes(x=reorder(ngram, -occurrences), y=occurrences)) 
graph4a<- g4 + geom_bar(fill="firebrick3", color="darkred", stat="identity") + theme_bw() + ylab("Occurences") + xlab("4 Word N-Grams")
graph4b<- graph4a + theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(bigramdf, aes(x=reorder(ngram, -occurrences), y=occurrences)) +
      geom_bar(fill="steelblue1", color="blue", stat="identity") + theme_bw() + ylab("Occurences") + xlab("2 Word N-Grams") +
      theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk Plot NGRAMS

Conclusion and Thoughts on Application Design

Overall, memory limitations in R prevent the ability to use the entire corpus for analysis. As a result, I have tried to be very efficient in the use of data to achieve the objective. It has been very useful to test the capabilities of tm, quanteda, and Rweka packages. It was cool to plot a “word cloud.” I had great difficulty with gridExtra as my graphs only took up ¼ of the page. This is why they are so small. Sorry about that.

The basic methodology for the n-gram text prediction model I will create is as follows:

Read a corpus and determine the optimal sample size to create a “dictonary” of terms and phrases to drive the app. The development will involve generating one, two, three and four word ngrams. A four word gram ngram will be used if three words are entered and we are predicting the fourth, three word ngram if two words are entered and we are prdicting the third etc. The final decision will be determining how and when to present the next word to the user. Is it when they hit space or is it after a discrete number of words has been entered.

ggplot(quadgramdf, aes(x=reorder(ngram, -occurrences), y=occurrences)) + 
        geom_bar(fill="firebrick3", color="darkred", stat="identity") + theme_bw() + ylab("Occurences") + xlab("4 Word N-Grams")+
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk Plot 4NGRAMS