Task 0: Understanding the Problem

As the world is becoming more open and globalized, the need for communication via different channels has grown rapidly. Touch-screen user input is increasingly becoming a standard where verbose text is still an issue. Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation. NLP has many sub-areas of focus as described on the Wikipedia page https://en.wikipedia.org/wiki/Natural_language_processing, all with the same end goal - computers to acknowledge information just as a human being would do. Trying to implement this, involves knowledge in linguistics, statistics and programming. The end goal of the Data Science Specialization Capstone Project is to produce a predictive text algorithm in R that based on a user’s text input the system will suggest the next most likely word to be entered.

Task 1: Data Acquisition and Cleaning

Loading the required packages in RStudio software to perform the analysis

library("NLP") #Generics NLP Function set
library("openNLP") #Generics NLP Function set
library("tm") #For Text Mining & Corpus workings
library("RWeka") #For n-gram vector generation
library("qdap") #For Text Mining & Corpus workings
library("ggplot2") #Charting functionality

Loading the 3 datasets in English provided by Coursera & SwiftKey

tus <- readLines("final/en_US/en_US.twitter.txt",3)
nus <- readLines("final/en_US/en_US.news.txt",3)
bus <- readLines("final/en_US/en_US.blogs.txt",3)

Summary Statistics about the data sets

Twitter dataset (167.1 MB)

system("wc -l final/en_US/en_US.twitter.txt") # Number of lines in the file

2360148 final/en_US/en_US.twitter.txt

Preview of the data layout

readLines(file("final/en_US/en_US.twitter.txt","r"), 4)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"

Blogs dataset (210.2 MB)

system("wc -l final/en_US/en_US.blogs.txt") # Number of lines in the file

899288 final/en_US/en_US.blogs.txt

Preview of the data layout

readLines(file("final/en_US/en_US.blogs.txt","r"), 4)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan âgodsâ."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."

News articles dataset (205.2 MB)

system("wc -l final/en_US/en_US.news.txt") # Number of lines in the file

1010242 final/en_US/en_US.news.txt

Preview of the data layout

readLines(file("final/en_US/en_US.news.txt","r"), 4)

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."

Only a part of the datasets will be going to be worked with, for faster workings with tokenization and analytics. The methodology is identical when the input is taken from the full files.

tiny1 <- readLines(file("final/en_US/en_US.twitter.txt","r"), 5000)
tiny2 <- readLines(file("final/en_US/en_US.blogs.txt","r"), 5000)
tiny3 <- readLines(file("final/en_US/en_US.news.txt","r"), 5000)
tiny <- paste(tiny1,tiny2,tiny3)

Sub-tasks 1 & 2: Tokenization & Profanity filtering

To identifying appropriate tokens such as words, punctuation, and numbers and removing profanity and other words we do not want to predict. The data was cleaned by using the tm package. I’ve downloaded a list of bad words to filter out. https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en

Breaking the loaded text lines into sentences, as when we do bi-grams, we don’t want to have a combination of the last word of the first sentence and the first word of the second sentence coming up a potential prediction.

tiny <- sent_detect(tiny, language = "en", model = NULL)

Building the main corpus, removing numbers, whitespaces, special characters and lowercasing all contents

corpus <- VCorpus(VectorSource(tiny)) # Building the main corpus
corpus <- tm_map(corpus, removeNumbers) # removing numbers
corpus <- tm_map(corpus, stripWhitespace) # removing whitespaces
corpus <- tm_map(corpus, content_transformer(tolower)) #lowercasing all contents
corpus <- tm_map(corpus, removePunctuation) # removing special characters

Removing the profanity words

badWordsVector <- VectorSource(readLines("final/bad_words.txt"))
corpus <- tm_map(corpus, removeWords, badWordsVector)

Converting Corpus to Data Frame for processing by the RWeka functions

cleanText<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)

Using the RWeka package for the single word tokenization, Bi-grams sets and Tri-grams sets for further Exploratory Analysis, keeping each in a separate list for now.

oneToken <- NGramTokenizer(cleanText, Weka_control(min = 1, max = 1))
biToken <- NGramTokenizer(cleanText, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
triToken <- NGramTokenizer(cleanText, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
bitriToken <- paste(triToken,biToken)

Task 2: Exploratory Analysis

Sub-tasks 1 & 2: Exploratory analysis & Understanding words & word pairs

As some words are more frequent than others, I’m analyzing the distributions of word frequencies. I’m doing the same for the 2-grams and 3-grams in the dataset?

Distribution of Word Frequencies - Single Word, Two word and Tri-word combinations: Preparing the data in correct format by transforming the n-grams to dataframes and ordering by Frequency for charting.

one <- data.frame(table(oneToken))
two <- data.frame(table(biToken))
tri <- data.frame(table(triToken))
oneSorted <- one[order(one$Freq,decreasing = TRUE),]
twoSorted <- two[order(two$Freq,decreasing = TRUE),]
triSorted <- tri[order(tri$Freq,decreasing = TRUE),]
one20 <- oneSorted[1:20,]
colnames(one20) <- c("Word","Frequency")
two20 <- twoSorted[1:20,]
colnames(two20) <- c("Word","Frequency")
tri20 <- triSorted[1:20,]
colnames(tri20) <- c("Word","Frequency")

Top 20 Single words (sorted alphabetically)

ggplot(one20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") +geom_text(aes(label=Frequency), vjust=-0.2)

Top 20 Bi-grams (sorted alphabetically)

ggplot(two20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="green") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Top 20 Tri-grams (sorted alphabetically)

ggplot(tri20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="red") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Next question that pops up is how many words are met only once. I’m doing this only for the single word set, as it is most meaningful, rather than doing for such count for the pairs.

oneUnique <- oneSorted[oneSorted$Freq == 1,]
nrow(oneUnique) # number of words in the text met only once

## [1] 7592

totalWordsUsed <- sum(oneSorted$Freq) # total words in the text

How many unique words does one need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

To answer this question, I’ve generated a progressive series of from 10% up to 90% to show the words needed to cover the certain percentage of text.

#Creating a function telling number of unique words needed to cover certain text percentage
wordspct <- function(targetpct) {
    totalWordsUsed <- sum(oneSorted$Freq)
    pctneeded = 0
    cumsum = 0
    i = 1
    while (pctneeded < targetpct)
    {
    cumsum = cumsum + oneSorted$Freq[i]
    pctneeded = cumsum/totalWordsUsed
    i = i + 1
    }
    return(i)
}

Generating the data using the above function for plotting the words needed to cover an N% of text

# Number of unique words required for n% text coverage
pcts <- c(10,20,30,40,50,60,70,80,90)
wordspct <- c(wordspct(0.1),wordspct(0.2),wordspct(0.3),wordspct(0.4),wordspct(0.5),wordspct(0.6),wordspct(0.7),wordspct(0.8),wordspct(0.9))

qplot(pcts,wordspct, geom=c("line","point")) + geom_text(aes(label=wordspct), hjust=1.3, vjust=-0.2) + scale_x_discrete(breaks=c(10,20,30,40,50,60,70,80,90), labels=c(10,20,30,40,50,60,70,80,90))

The discovery from the chart above is that for every 10% gain in text coverage, the words needed roughly need to double.

Ideas on coverage increase and corpora reduction for efficiency

A few ideas to increase the coverage and identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

To increase coverage for a particular individual, scanning the person’s writing style and assigning higher values to the n-grams from their writing (tweets, sms, facebook posts and comments, emails, etc) Also potentially checking the current holidays based on dates to suggest words associated with them and give higher weights to them:

Example: Helloween - scary, candy, etc.

Geo-awareness: if one is at a concert of U2, the words suggested to be related to the event, or potentially from Calendar events. These methods will increase “temporary” the corpora, irrespective of what the corpora contains and optimize performance, rather than maintaing the whole dataset constantly. Also, stemming will allow us to get the common root of majority of words and reduce the number of words and performance time.

Findings up to now

The biggest tradeoff - amount of data analyzed vs performance time. A lot of stopwords are with very high frequency. I’ve intentially kept them as I want to be able to predict them. Nothing way out of the ordinary as per the table results above of most popular bi-grams and tri-grams.

Future Developments on Track

Modelling and Prediction to be completed next week.

End product of the Data Science Specialization Capstone Project:

A hosted Shiny app in a web browser that would provide 4 relevance sorted suggestions for the next most likely word and potentially, if it doesn’t increase the time too much, add also 3 suggestions of next two most likely words. The interface of the app, will be as basic as possible, just a textinput field on the left with the respective predictions on the right, clickable. I’ll make a few color changes from the standard templates. A description of the product will also be included.

Data Science Specialization Capstone Project: Natural Language Processing

Peng Zhao

Tuesday, March 10, 2015

Task 0: Understanding the Problem

Task 1: Data Acquisition and Cleaning

Summary Statistics about the data sets

Sub-tasks 1 & 2: Tokenization & Profanity filtering

Task 2: Exploratory Analysis

Sub-tasks 1 & 2: Exploratory analysis & Understanding words & word pairs

Top 20 Single words (sorted alphabetically)

Top 20 Bi-grams (sorted alphabetically)

Top 20 Tri-grams (sorted alphabetically)

How many unique words does one need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Ideas on coverage increase and corpora reduction for efficiency

Findings up to now

Future Developments on Track

End product of the Data Science Specialization Capstone Project: