B. McCracken - Tech Entrepreneur
This is the progress report for John Hopkins University Data Science Specialization Capstone course. The project is focused on demonstration of the use of Natural Language Processing tools to build a model to predict the next word typed in a sentence. The project uses several language processing packages:
The tm package was the primary package utilized in this project
The first step in the project is to prepare the environment there are three steps: 1. Load packages needed 2. Download the data if needed 3. Create a sample directory to hold the sample corpus
packages<-function(x){ #function to detemine if needed packages are installed
x<-as.character(match.call()[[2]])
if (!require(x,character.only=TRUE)){
install.packages(pkgs=x,repos="http://cran.r-project.org")
require(x,character.only=TRUE)
}
}
packages(dplyr); packages (downloader); packages(NLP); packages(openNLP); packages(RWeka)
packages(stringr); packages(stringi); packages(tm); packages(quanteda); packages(ggplot2)
packages(gridExtra); remove(packages)
#Step 2. Download the data for the project. Test to see if it already exist first since this a 510MB file
if (!file.exists("dataset.zip")) {
fileurl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download(fileurl, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = ".") # this command creates a directory "????" that contains the data and extracts the files
}
#Step 3. Test to see if the sample directory already exist if the code is run more than once
if (!dir.exists("./sample")) {
dir.create("./sample")
}
The data set from SwiftKey allows for analysis of multiple languages English, German, Russian and Finnish This project will focus on the English version. The entire corpus of three documents is 510Mb which is too large to manipulate. A sample of each document is selected and then written out in a sample directory.
#sample the twitter file
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul=T)
twitter_s <- sample(twitter, length(twitter) * 0.001); remove(twitter)
#sample the blogs file
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul=T)
blogs_s <- sample(blogs, length(blogs) * 0.001); remove(blogs)
#sample the news file
news <- readLines("./final/en_US/en_US.news.txt", encoding="UTF-8", skipNul=TRUE)
news_s <- sample(news, length(news) * 0.001); remove(news)
#write the files to a folder
writeLines(blogs_s, con = "./sample/en_US.blogs.txt", sep = "\n", useBytes = F)
writeLines(twitter_s, con = "./sample/en_US.twitter.txt", sep = "\n", useBytes = F)
writeLines(news_s, con = "./sample/en_US.news.txt", sep = "\n", useBytes = F)
Analyze Corpus: The corpus has three files. A US News file, blog file extracts and a file of tweets. A sample of all the files was chosen for analysis to build or broader range of phrases and terms upon which to build a aplication dictionary. In cleaning the data, stemming and removing stop words produced somewhat meaninless words and phrases.
#================================
#The code uses tm to create a corpus of the documents in a directory and then it uses quanteda command to create a corpus
##================================
cname <- file.path("./sample/"); docs <- Corpus(DirSource(cname)); myCorpus <- corpus(docs)
#=======Create document feature matrix
myDfm <- dfm(myCorpus, ignoredFeatures = stopwords("english"), stem = FALSE) #remove punctuation, make lowercase, index
top10 <- topfeatures(myDfm, 10); top10df <- data.frame(ngram=names(top10), occurrences=top10); top10df <- arrange(top10df, desc(occurrences))
#======Cleaning the tm Vcorpus before creating ngrams
toEmpty <- content_transformer(function(x, pattern) gsub(pattern, "", x))
docs <- tm_map(docs, toEmpty, "#\\w+"); docs <- tm_map(docs, removePunctuation); docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers); docs <- tm_map(docs, stripWhitespace)
summary(myCorpus); topfeatures(myDfm, 10)
Corpus consisting of 3 documents.
Text Types Tokens Sentences author datetimestamp description
text1 8162 42449 2029 <NA> 2016-08-29 02:28:37 <NA>
text2 1305 2827 120 <NA> 2016-08-29 02:28:37 <NA>
text3 7801 37650 2555 <NA> 2016-08-29 02:28:37 <NA>
heading id language origin
<NA> en_US.blogs.txt en <NA>
<NA> en_US.news.txt en <NA>
<NA> en_US.twitter.txt en <NA>
Source: Converted from tm VCorpus 'docs'
Created: Sun Aug 28 19:28:37 2016
Notes:
just one u will like can get time day good
248 229 200 197 192 189 176 176 170 162
Create and Plot NGRAMS: RWeka was used to created a tokenizer from 2,3 and 4 word n-grams could be extracted. ggplot was used to plot the top 10 items of the 1-4 word ngrams. The three and four word ngrams start to look more like news stories than tweets as in the one and two word n-grams
#=================NGRAM_Tokenizer====================================
ngram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n_gram, max= n_gram)) #===========2-word ngrams =============
n_gram =2; bigramtdm <- TermDocumentMatrix(docs, control = list(tokenize = ngram_tokenizer))
bigramtdm.matrix <- as.matrix(bigramtdm,rownames.force); bitopwords <- rowSums(bigramtdm.matrix)
bitop10 <- head(sort(bitopwords, decreasing = TRUE),10); bigramdf <- data.frame(ngram=names(bitop10), occurrences=bitop10)
bigramdf <- arrange(bigramdf, desc(occurrences))
#===========3-word ngrams =============
n_gram =3; trigramtdm <- TermDocumentMatrix(docs, control = list(tokenize = ngram_tokenizer))
trigramtdm.matrix <- as.matrix(trigramtdm); tritopwords <- rowSums(trigramtdm.matrix)
tritop10 <-head(sort(tritopwords, decreasing = TRUE),10); trigramdf <- data.frame(ngram=names(tritop10), occurrences=tritop10)
trigramdf <- arrange(trigramdf, desc(occurrences))
#===========4-word ngrams =============
n_gram =4; quadgramtdm <- TermDocumentMatrix(docs, control = list(tokenize = ngram_tokenizer))
quadgramtdm.matrix <- as.matrix(quadgramtdm); quadtopwords <- rowSums(quadgramtdm.matrix)
quadtop10 <- head(sort(quadtopwords, decreasing = TRUE),10); quadgramdf <- data.frame(ngram=names(quadtop10), occurrences=quadtop10)
quadgramdf <- arrange(quadgramdf, desc(occurrences))
#================================Ploting Ngrams - setting up for grid extra
g1<- ggplot(top10df, aes(x=reorder(ngram, -occurrences), y=occurrences))
graph1a<- g1 + geom_bar(fill="grey", color="black", stat="identity") + theme_bw() + ylab("Occurences") + xlab("Top Words In Corpus")
graph1b<- graph1a + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2<- ggplot(bigramdf, aes(x=reorder(ngram, -occurrences), y=occurrences))
graph2a<- g2 + geom_bar(fill="steelblue1", color="blue", stat="identity") + theme_bw() + ylab("Occurences") + xlab("2 Word N-Grams")
graph2b<- graph2a + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g3<- ggplot(trigramdf, aes(x=reorder(ngram, -occurrences), y=occurrences))
graph3a<- g3 + geom_bar(fill="gold2", color="darkgoldenrod4", stat="identity") + theme_bw() + ylab("Occurences") + xlab("3 Word N-Grams")
graph3b<- graph3a + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g4<- ggplot(quadgramdf, aes(x=reorder(ngram, -occurrences), y=occurrences))
graph4a<- g4 + geom_bar(fill="firebrick3", color="darkred", stat="identity") + theme_bw() + ylab("Occurences") + xlab("4 Word N-Grams")
graph4b<- graph4a + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(bigramdf, aes(x=reorder(ngram, -occurrences), y=occurrences)) +
geom_bar(fill="steelblue1", color="blue", stat="identity") + theme_bw() + ylab("Occurences") + xlab("2 Word N-Grams") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Overall, memory limitations in R prevent the ability to use the entire corpus for analysis. As a result, I have tried to be very efficient in the use of data to achieve the objective. It has been very useful to test the capabilities of tm, quanteda, and Rweka packages. It was cool to plot a “word cloud.” I had great difficulty with gridExtra as my graphs only took up ¼ of the page. This is why they are so small. Sorry about that.
The basic methodology for the n-gram text prediction model I will create is as follows:
Read a corpus and determine the optimal sample size to create a “dictonary” of terms and phrases to drive the app. The development will involve generating one, two, three and four word ngrams. A four word gram ngram will be used if three words are entered and we are predicting the fourth, three word ngram if two words are entered and we are prdicting the third etc. The final decision will be determining how and when to present the next word to the user. Is it when they hit space or is it after a discrete number of words has been entered.
ggplot(quadgramdf, aes(x=reorder(ngram, -occurrences), y=occurrences)) +
geom_bar(fill="firebrick3", color="darkred", stat="identity") + theme_bw() + ylab("Occurences") + xlab("4 Word N-Grams")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))