Overview

The goal of the project is to build a model that can predict the next word given an input word/sentence fragment. This report examines the three sets of writing samples and performs some explorary analysis on them. Some 1-gram (one word at a time) to 3-gram (grouping into 3 word phrases) models are briefly examined on the samples of the datasets. For the next step, a 1-gram to n-gram model using all the text datasets will be built to predict the next word given a phrase is enetered.

About the data

Download Link : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. To facilitate typing on mobile devices, SwiftKey, our corporate partner in this capstone project, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

In this capstone project we will apply natural language processing(NLP), text mining, and the tools in R for exploratory data analysis and for the fllowing text modelling and prediction as well. In this report, We will focus on the files that contain English data

Getting data

Downloading & unzip

## url  <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
## download.file(url, destfile="coursera-swiftkey.zip")
## theFileList=c("final/en_US/en_US.twitter.txt", "final/en_US/en_US.news.txt","final/en_US/en_US.blogs.txt") 
## unzip("coursera-swiftkey.zip", files = theFileList, exdir="en_US", overwrite=TRUE, junkpaths=TRUE)

Importing data & summarizing

After downloading and unzipping data, We would like to do an basic analysis in the raw files to get an idea of the size of the texts. So we tried to use readLine() and scan() functions to get in the data.

library(R.utils)
library(stringr)
setwd("tm")
Blogs<-"./en_US.blogs.txt" 
BlogData <- scan(Blogs, character(0), sep = "\n") # separate each line
line_blogs<-as.numeric(countLines(Blogs))
line_blogs
## [1] 899288
## Number of lines for news data
News <-"./en_US.news.txt"
NewsData <-scan(News, character(0), sep = "\n") # separate each line
Line_news <-as.numeric(countLines(News))
Line_news
## [1] 1010242
## no. of lines for tweet data
Tw <-"./en_US.twitter.txt"
TWData <-scan(Tw, character(0), sep = "\n")
Line_tw <-as.numeric(countLines(Tw))
Line_tw
## [1] 2360148

Selecting a sample

From the size evaluation of three files above, we can see these files are fairly large. To build a model to do the exploratory ananlysis, we will randomly sample blocks from each of these files considering the limited computer memory size and the time consuming. We will get an good estimation of the frequently used words in each file.

setwd("tm")
conblog <- file("./en_US.blogs.txt", "r") 
DataBloga <-readLines(conblog, (line_blogs/1000),encoding="latin1")
writeLines(DataBloga, con="databloga.txt", "\n")
close(conblog)

conbnews <- file("./en_US.news.txt", "r") 
DataNewsa<-readLines(conbnews, (Line_news/1000),encoding="latin1")
writeLines(DataNewsa, con="datanewsa.txt", "\n")
close(conbnews)

conbtw<-file("./en_US.twitter.txt", "r")
DataTwa<-readLines(conbtw, (Line_tw/1000),encoding="latin1")
writeLines(DataTwa, con="datatwa.txt", "\n")
close(conbtw)

Data cleaning, preprocessing(tokenization & profanity filtering)

Before we can the analysis on the sampled data, we need to do data processing to remove anomalies, and the data processing includs: - change all characters to lowercase - remove punctuations [!“#$%&’()+,-./:;<=>?@[]^_{|}~] - remove numbers - remove extra whitespace - remove profanity - remove stop words For cleaning the strong language (profane words) in the text, we used a list of bad words (profanity)(LINK : http://www.cs.cmu.edu/~biglou/resources/bad-words.txt) For text mining, the tm package and RWeka package are used in this study.

library(openNLP)
library(tm) 
library(qdap)
library(RWeka)
DataBloga <- unlist(DataBloga)
DataNewsa <- unlist(DataNewsa)
DataTwa <- unlist(DataTwa)
OneDoc <- paste(DataBloga,DataNewsa,DataTwa)  
OneDoc <- sent_detect(OneDoc, language = "en", model = NULL) # splitting of text paragraphs into sentences.

corpus <- VCorpus(VectorSource(OneDoc)) # main corpus with all sample files
corpus <- tm_map(corpus, removeNumbers) 
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, tolower) 
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, PlainTextDocument)

# Remove profanity words
setwd("tm")
conprofane <- file("./bad-words.txt", "r")
profanity_vector <- VectorSource(readLines(conprofane))
corpus <- tm_map(corpus, removeWords, profanity_vector) 
corpus <- gsub("http\\w+","", corpus)
# Converting Corpus to Data Frame for processing by the RWeka functions
corpus<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F) 
Onegram <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1,delimiters = " \\r\\n\\t.,;:\"()?!"))
Bigram <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2,delimiters = " \\r\\n\\t.,;:\"()?!"))
Trigram <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3,delimiters = " \\r\\n\\t.,;:\"()?!"))

Exploratory Analysis Results

After we processed and tokenized the samples data, we can transform the data with the n-grams into data frames and count the frequency of the words for exploaratory analysis plotting. In each sample, top 20 most frequent words/phrases are selected.

# converting tokens of n-grams into tables
Tab_onegram <- data.frame(table(Onegram))
Tab_bigram <- data.frame(table(Bigram))
Tab_trigram <- data.frame(table(Trigram))
#head(Tab_trigram, n=6)

# sorting the word distribution frequency  
OnegramGrp <- Tab_onegram[order(Tab_onegram$Freq,decreasing = TRUE),]
BigramGrp <- Tab_bigram[order(Tab_bigram$Freq,decreasing = TRUE),]
TrigramGrp <- Tab_trigram[order(Tab_trigram$Freq,decreasing = TRUE),]

# Three individual samples. Top 35 words are selected.
OneSamp <- OnegramGrp[1:35,]
colnames(OneSamp) <- c("Word","Frequency")
BiSamp <- BigramGrp[1:35,]
colnames(BiSamp) <- c("Word","Frequency")
TriSamp <- TrigramGrp[1:35,]
colnames(TriSamp) <- c("Word","Frequency")

Plot Example - Most Frequently 1-gram 2-grams & 3-grams

With the counts of words with their frequencies, we can plot charts to show the distribution of the words frequency. The bar chart below gives the example of trigram (3-gram) words frequency count for the top 35 words.

library(ggplot2)
## 1gram
ggplot(OneSamp, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity", fill="Blue") +geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

## 2 grams
ggplot(BiSamp, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity", fill="Blue") +geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

## 3grams
ggplot(TriSamp, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity", fill="Blue") +geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions

From the exploratory analysis of the text analysis for news, blogs, and twitter texts, we have obtained the folowing findings: - the sizes of the files - the most frequent words/phrases in each file

Next Step

For the final analysis, text modelling, and text prediction, we need to do the following studies: -N-Gram modelling of the full text data sets -Optimize model for low memory utilization -Implement model as a Shiny App