Natural Language Processing - Text Mining and Text Prediction: Exploratory Analysis on Swiftkey Datasets

Synnopsis

The goal of the project is to build a model that can predict the next word given an input word/sentence fragment. This report examines the three sets of writing samples and performs some explorary analysis on them. Some 1-gram (one word at a time) to 3-gram (grouping into 3 word phrases) models are briefly examined on the samples of the datasets. For the next step, a 1-gram to n-gram model using all the text datasets will be built to predict the next word given a phrase is enetered.

Data Processing

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. To facilitate typing on mobile devices, SwiftKey, our corporate partner in this capstone project, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

In this capstone project we will apply natural language processing(NLP), text mining, and the tools in R for exploratory data analysis and for the fllowing text modelling and prediction as well.

About the data

The datasets for this project can be downloaded from the web site and unzip it into your working directory: Coursera-SwiftKey Datasets
The data is originally from: HC Corpora

In this report, We will focus on the files that contain English data, which are en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt files.

Data downloading and unzipping

Global settings: echo codes, present results and store existing results for all analysis

library(knitr)
opts_chunk$set(echo = TRUE, results = "show")

Downloading the dataset

# setwd("D:/2014_Coursera/DS-10_Capstone Project/")
# url  <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# if (!file.exists("coursera-swiftkey.zip")){
#   download.file(url, destfile="coursera-swiftkey.zip")
# }

Unzipping the dataset

# theFileList=c("final/en_US/en_US.twitter.txt", "final/en_US/en_US.news.txt","final/en_US/en_US.blogs.txt") 
# unzip("coursera-swiftkey.zip", files = theFileList, exdir="en_US", overwrite=TRUE, junkpaths=TRUE)

Data acquisition and basic evaluation

After downloading and unzipping data, We would like to do an basic analysis in the raw files to get an idea of the size of the texts. So we tried to use readLine() and scan() functions to get in the data.

setwd("D:/2014_Coursera/DS-10_Capstone Project/en_US")
library(R.utils)

## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

library(stringr)

Number of lines for Blogs.txt

Blogs<-"./en_US.blogs.txt" 
BlogData <- scan(Blogs, character(0), sep = "\n") # separate each line
line_blogs<-as.numeric(countLines(Blogs))
line_blogs

## [1] 899288

Number of lines for News.txt

News <-"./en_US.news.txt"
NewsData <-scan(News, character(0), sep = "\n") # separate each line
Line_news <-as.numeric(countLines(News))
Line_news

## [1] 1010242

Number of lines for Twitter.txt

Tw <-"./en_US.twitter.txt"
TWData <-scan(Tw, character(0), sep = "\n") # separate each line

## Warning: embedded nul(s) found in input

Line_tw <-as.numeric(countLines(Tw))
Line_tw

## [1] 2360148

As we can see from above that the total lines of blogs.txt, news.txt and twitter.txt are 899288, 1010242 and 2360148 respectively.

Data sampling

From the size evaluation of three files above, we can see these files are fairly large. To build a model to do the exploratory ananlysis, we will randomly sample blocks from each of these files considering the limited computer memory size and the time consuming. We will get an good estimation of the frequently used words in each file.

setwd("D:/2014_Coursera/DS-10_Capstone Project/en_US")
conblog <- file("./en_US.blogs.txt", "r") 
DataBloga <-readLines(conblog, (line_blogs/1000),encoding="latin1")
writeLines(DataBloga, con="databloga.txt", "\n")
close(conblog)

conbnews <- file("./en_US.news.txt", "r") 
DataNewsa<-readLines(conbnews, (Line_news/1000),encoding="latin1")
writeLines(DataNewsa, con="datanewsa.txt", "\n")
close(conbnews)

conbtw<-file("./en_US.twitter.txt", "r")
DataTwa<-readLines(conbtw, (Line_tw/1000),encoding="latin1")
writeLines(DataTwa, con="datatwa.txt", "\n")
close(conbtw)

Data processing - Tokenization and Profanity Fiiltering

Before we can the analysis on the sampled data, we need to do data processing to remove anomalies, and the data processing includs: - change all characters to lowercase - remove punctuations [!“#$%&’()+,-./:;<=>?@[]^_{|}~] - remove numbers - remove extra whitespace - remove profanity - remove stop words For cleaning the strong language (profane words) in the text, we used a list of bad words (profanity). For text mining, the tm package and RWeka package are used in this study.

library(openNLP)
library(tm) 
library(qdap)

## Loading required package: ggplot2
## Loading required package: qdapDictionaries
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## 
## The following object is masked from 'package:base':
## 
##     Filter

library(RWeka)

# DataBloga <- unlist(DataBloga)
# DataNewsa <- unlist(DataNewsa)
# DataTwa <- unlist(DataTwa)
OneDoc <- paste(DataBloga,DataNewsa,DataTwa)  
OneDoc <- sent_detect(OneDoc, language = "en", model = NULL) # splitting of text paragraphs into sentences.

corpus <- VCorpus(VectorSource(OneDoc)) # main corpus with all sample files
corpus <- tm_map(corpus, removeNumbers) 
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, tolower) 
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#corpus <- tm_map(corpus, PlainTextDocument)

# Remove profanity words
setwd("D:/2014_Coursera/DS-10_Capstone Project/en_US")
conprofane <- file("./profanity.txt", "r")
profanity_vector <- VectorSource(readLines(conprofane))

## Warning: incomplete final line found on './profanity.txt'

corpus <- tm_map(corpus, removeWords, profanity_vector) 
corpus <- gsub("http\\w+","", corpus)

# Converting Corpus to Data Frame for processing by the RWeka functions
# cleantext<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F) 
# head(cleantext, n=6)

# Using the RWeka package for the 1-gram(single word) tokenization, 2-grams sets 
# and 3-grams sets for further exploratory analysis.
Onegram <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1,delimiters = " \\r\\n\\t.,;:\"()?!"))
Bigram <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2,delimiters = " \\r\\n\\t.,;:\"()?!"))
Trigram <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3,delimiters = " \\r\\n\\t.,;:\"()?!"))

Exploratory Analysis Results

After we processed and tokenized the samples data, we can transform the data with the n-grams into data frames and count the frequency of the words for exploaratory analysis plotting. In each sample, top 20 most frequent words/phrases are selected.

# converting tokens of n-grams into tables
Tab_onegram <- data.frame(table(Onegram))
Tab_bigram <- data.frame(table(Bigram))
Tab_trigram <- data.frame(table(Trigram))
#head(Tab_trigram, n=6)

# sorting the word distribution frequency  
OnegramGrp <- Tab_onegram[order(Tab_onegram$Freq,decreasing = TRUE),]
BigramGrp <- Tab_bigram[order(Tab_bigram$Freq,decreasing = TRUE),]
TrigramGrp <- Tab_trigram[order(Tab_trigram$Freq,decreasing = TRUE),]

# Three individual samples. Top 35 words are selected.
OneSamp <- OnegramGrp[1:35,]
colnames(OneSamp) <- c("Word","Frequency")
BiSamp <- BigramGrp[1:35,]
colnames(BiSamp) <- c("Word","Frequency")
TriSamp <- TrigramGrp[1:35,]
colnames(TriSamp) <- c("Word","Frequency")

Plot Example - Most Frequently 1-gram

With the counts of words with their frequencies, we can plot charts to show the distribution of the words frequency. The bar chart below gives the example of trigram (3-gram) words frequency count for the top 35 words.

ggplot(OneSamp, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity", fill="Blue") +geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot of chunk unnamed-chunk-12

Plot Example - Most Frequently 2-grams

ggplot(BiSamp, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity", fill="Blue") +geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot of chunk unnamed-chunk-13

Plot Example - Most Frequently 3-grams

ggplot(TriSamp, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity", fill="Blue") +geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot of chunk unnamed-chunk-14

Conclusions

From the exploratory analysis of the text analysis for news, blogs, and twitter texts, we have obtained the folowing findings: - the sizes of the files - the most frequent words/phrases in each file

Next Step

For the final analysis, text modelling, and text prediction, we need to do the following studies:

N-Gram modelling of the full text data sets
Optimize model for low memory utilization
Implement model as a Shiny App