Milestone Report - My JHU Data Science Specialization Capstone Project

Joshua J

March 26 2015

The goal of the Data Science Specialization Capstone Project is apply everything we have learned from the program to produce a predictive text algorithm in R. A Shiny application will be developed to demonstrate the prediction of the next word. From the application, a user will provide a word or sentence; the system will suggest the next most likely word for user to choose. To implement this, it involves knowledge in linguistics, statistics and programming and natural language processing.

In this milestone report, it covers the followings:

Demonstrate that data have been downloaded and have successfully loaded it in.
Display a basic summary statistics (row count) about the data sets.
A chart that shows interesting findings about the data loaded at this point.
My next step for creating a prediction algorithm and Shiny app.

Load Datasets and Conduct Exploratory Analysis

Loading the required packages in RStudio software to perform the analysis

setwd("~/My_Projects/JohnsHDS/10_Capstone_Project_03-2015")
library("tm") #For Text Mining & Corpus workings
library("NLP") #Generics NLP Function set
library("openNLP") #Generics NLP Function set
library("ggplot2") #Charting functionality
library("RWeka") #For n-gram vector generation
library("qdap") #For Text Mining & Corpus workings

Loading the datasets provided by Coursera & SwiftKey, which are available here

usTwiter <- readLines("final/en_us/en_US.twitter.txt",3)
usNews <- readLines("final/en_us/en_US.news.txt",3)
usBlogs <- readLines("final/en_us/en_US.blogs.txt",3)

Making a Small Dataset

Due to my laptop capability, make a small dataset to work with.

tinyT <- readLines(file("final/en_us/en_US.twitter.txt","r"), 4000)
tinyN <- readLines(file("final/en_us/en_US.news.txt","r"), 4000)
tinyB <- readLines(file("final/en_us/en_US.blogs.txt","r"), 4000)
tiny <- paste(tinyT,tinyN,tinyB)

# make input text lines into sentences
tiny <- sent_detect(tiny, language = "en", model = NULL)

Tokenization and Filtering Profanity Words

Building a corpus, removing the followings: numbers, whitespaces, special characters and then make all lowercase.

corpus <- VCorpus(VectorSource(tiny)) # Building the main corpus
corpus <- tm_map(corpus, removeNumbers) # removing numbers
corpus <- tm_map(corpus, stripWhitespace) # removing whitespaces
corpus <- tm_map(corpus, removePunctuation) # removing special characters
corpus <- tm_map(corpus, content_transformer(tolower)) #lowercasing all contents

Removing the Profanity Words (a list from Google)

badwordsvector <- VectorSource(readLines("ProfanityWords.txt"))
corpus <- tm_map(corpus, removeWords, badwordsvector, lazy=TRUE)

Converting Corpus to Data Frame for processing by the RWeka functions

cleantext<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F, lazy=TRUE)

Using the RWeka package for the One-gram, Bi-grams sets and Tri-grams sets for further analysis

onetoken <- NGramTokenizer(cleantext, Weka_control(min = 1, max = 1))
bitoken <- NGramTokenizer(cleantext, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
tritoken <- NGramTokenizer(cleantext, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
bitritoken <- paste(tritoken,bitoken)

Summarize the Datasets

system("wc -l final/en_us/en_US.twitter.txt") # Number of records in the file

2360148 final/en_us/en_US.twitter.txt

system("wc -l final/en_us/en_US.news.txt") # Number of records in the file

1010242 final/en_us/en_US.news.txt

system("wc -l final/en_us/en_US.blogs.txt") # Number of records in the file

899288 final/en_us/en_US.blogs.txt

Display Sample Data

The first three-line of Twitter data:

readLines(file("final/en_us/en_US.twitter.txt","r"), 3)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."

The first three-line of News data:

readLines(file("final/en_us/en_US.news.txt","r"), 3)

## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."

The first three-line of Blogs data:

readLines(file("final/en_us/en_US.blogs.txt","r"), 3)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

Chart Feature of The Data

Calculate the word frequencies for One-gram, Bi-grams and Tri-grams.

one <- data.frame(table(onetoken))
two <- data.frame(table(bitoken))
tri <- data.frame(table(tritoken))
onesorted <- one[order(one$Freq,decreasing = TRUE),]
twosorted <- two[order(two$Freq,decreasing = TRUE),]
trisorted <- tri[order(tri$Freq,decreasing = TRUE),]

one15 <- onesorted[1:15,]
colnames(one15) <- c("Word","Frequency")

two15 <- twosorted[1:15,]
colnames(two15) <- c("Word","Frequency")

#tri20 <- trisorted[1:15,]
#colnames(tri20) <- c("Word","Frequency")

Chart the top 15 Single words (sorted alphabetically)

ggplot(one15, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="black") +geom_text(aes(label=Frequency), vjust=-0.4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

The Next Step

The next step is to develop a model, a front-end application to demonstrate a text mining capability predicting the next word based on the user’s input.

The front-end application will be using R Shiny. From the Shiny App UI, A sidebar panel on the left takes user’s input, and the main panel will be used to display the word prediction from the model. An end-user will use the text box from the lift, type a word or a simple sentence; the predicted next word will be displayed in the main panel on the right hand side.

From the Shiny App Server side, the n-gram (2,3,4) is generated for using in the next word prediction model. The best match is used to predict the next word following the input n-gram. My implementation plan is going to continue leverage RWeka text mining library and relevant R packages. If time allows, I will also explore Python NGram packages.