Joshua J
March 26 2015
The goal of the Data Science Specialization Capstone Project is apply everything we have learned from the program to produce a predictive text algorithm in R. A Shiny application will be developed to demonstrate the prediction of the next word. From the application, a user will provide a word or sentence; the system will suggest the next most likely word for user to choose. To implement this, it involves knowledge in linguistics, statistics and programming and natural language processing.
In this milestone report, it covers the followings:
setwd("~/My_Projects/JohnsHDS/10_Capstone_Project_03-2015")
library("tm") #For Text Mining & Corpus workings
library("NLP") #Generics NLP Function set
library("openNLP") #Generics NLP Function set
library("ggplot2") #Charting functionality
library("RWeka") #For n-gram vector generation
library("qdap") #For Text Mining & Corpus workings
Loading the datasets provided by Coursera & SwiftKey, which are available here
usTwiter <- readLines("final/en_us/en_US.twitter.txt",3)
usNews <- readLines("final/en_us/en_US.news.txt",3)
usBlogs <- readLines("final/en_us/en_US.blogs.txt",3)
Due to my laptop capability, make a small dataset to work with.
tinyT <- readLines(file("final/en_us/en_US.twitter.txt","r"), 4000)
tinyN <- readLines(file("final/en_us/en_US.news.txt","r"), 4000)
tinyB <- readLines(file("final/en_us/en_US.blogs.txt","r"), 4000)
tiny <- paste(tinyT,tinyN,tinyB)
# make input text lines into sentences
tiny <- sent_detect(tiny, language = "en", model = NULL)
Building a corpus, removing the followings: numbers, whitespaces, special characters and then make all lowercase.
corpus <- VCorpus(VectorSource(tiny)) # Building the main corpus
corpus <- tm_map(corpus, removeNumbers) # removing numbers
corpus <- tm_map(corpus, stripWhitespace) # removing whitespaces
corpus <- tm_map(corpus, removePunctuation) # removing special characters
corpus <- tm_map(corpus, content_transformer(tolower)) #lowercasing all contents
Removing the Profanity Words (a list from Google)
badwordsvector <- VectorSource(readLines("ProfanityWords.txt"))
corpus <- tm_map(corpus, removeWords, badwordsvector, lazy=TRUE)
Converting Corpus to Data Frame for processing by the RWeka functions
cleantext<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F, lazy=TRUE)
Using the RWeka package for the One-gram, Bi-grams sets and Tri-grams sets for further analysis
onetoken <- NGramTokenizer(cleantext, Weka_control(min = 1, max = 1))
bitoken <- NGramTokenizer(cleantext, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
tritoken <- NGramTokenizer(cleantext, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
bitritoken <- paste(tritoken,bitoken)
system("wc -l final/en_us/en_US.twitter.txt") # Number of records in the file
2360148 final/en_us/en_US.twitter.txt
system("wc -l final/en_us/en_US.news.txt") # Number of records in the file
1010242 final/en_us/en_US.news.txt
system("wc -l final/en_us/en_US.blogs.txt") # Number of records in the file
899288 final/en_us/en_US.blogs.txt
The first three-line of Twitter data:
readLines(file("final/en_us/en_US.twitter.txt","r"), 3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
The first three-line of News data:
readLines(file("final/en_us/en_US.news.txt","r"), 3)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
The first three-line of Blogs data:
readLines(file("final/en_us/en_US.blogs.txt","r"), 3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
Calculate the word frequencies for One-gram, Bi-grams and Tri-grams.
one <- data.frame(table(onetoken))
two <- data.frame(table(bitoken))
tri <- data.frame(table(tritoken))
onesorted <- one[order(one$Freq,decreasing = TRUE),]
twosorted <- two[order(two$Freq,decreasing = TRUE),]
trisorted <- tri[order(tri$Freq,decreasing = TRUE),]
one15 <- onesorted[1:15,]
colnames(one15) <- c("Word","Frequency")
two15 <- twosorted[1:15,]
colnames(two15) <- c("Word","Frequency")
#tri20 <- trisorted[1:15,]
#colnames(tri20) <- c("Word","Frequency")
Chart the top 15 Single words (sorted alphabetically)
ggplot(one15, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="black") +geom_text(aes(label=Frequency), vjust=-0.4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The next step is to develop a model, a front-end application to demonstrate a text mining capability predicting the next word based on the user’s input.
The front-end application will be using R Shiny. From the Shiny App UI, A sidebar panel on the left takes user’s input, and the main panel will be used to display the word prediction from the model. An end-user will use the text box from the lift, type a word or a simple sentence; the predicted next word will be displayed in the main panel on the right hand side.
From the Shiny App Server side, the n-gram (2,3,4) is generated for using in the next word prediction model. The best match is used to predict the next word following the input n-gram. My implementation plan is going to continue leverage RWeka text mining library and relevant R packages. If time allows, I will also explore Python NGram packages.