In this Data Science capstone we will apply data modeling and prediction in the area of natural language processing. The project is done in association with SwiftKey, a company developing a smart prediction technology for easier mobile typing. A first step towards working on this project is to familiarize ourself with Natural Language Processing, Text Mining, and the associated tools in R. The goal of this milestone report is to demonstrate that we are familiar with the data and on track to create a prediction algorithm.
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The dataset can be downloaded from the location below, and once unzipped will provide us a folder “en_US/” containing the English text that we will use in our data analysis:
destination_file <- "Coursera-SwiftKey.zip"
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# execute the download
download.file(source_file, destination_file)
# extract the files from the zip file
unzip(destination_file)
Let’s observe the data to understand what it looks like and how much effort is needed to clean the data.
The line count for each file is, wc -l *.txt:
899288 en_US.blogs.txt
1010242 en_US.news.txt
2360148 en_US.twitter.txt
4269678 total
The word count for each file is, wc -w *.txt:
37334690 en_US.blogs.txt
34372720 en_US.news.txt
30374206 en_US.twitter.txt
102081616 total
File size (in MegaBytes/MB):
file.info("Project/final/en_US/en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("Project/final/en_US/en_US.news.txt")$size / 1024^2
## [1] 196.2775
file.info("Project/final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
Due to the file sizes above, caching the data will improve long computations and plots that are expensive to generate in knitr.
library(knitr)
## Warning: package 'knitr' was built under R version 3.1.3
opts_chunk$set(cache=TRUE, cache.path = 'DocumentName_cache/', fig.path='figure/')
The corpora are contained in three separate plain-text files that will import as follows. Note that for the final capstone project, the data selection will be randomnized.
iblogs<-file("Project/final/en_US/en_US.blogs.txt")
blogs<-readLines(iblogs,encoding="UTF-8", 5000)
close(iblogs)
inews<-file("Project/final/en_US/en_US.news.txt")
news<-readLines(inews,encoding="UTF-8", 5000)
close(inews)
itwitter<-file("Project/final/en_US/en_US.twitter.txt")
twitters<-readLines(itwitter,encoding="UTF-8", 5000)
close(itwitter)
news <- paste(blogs,news,twitters) #combine lines from all 3 source documents
In the next section, basic data tables are created from this text, including sorting the frequency of 1-gram, 2-grams and 3-grams.
Create a corpus and identify appropriate tokens such as words, punctuation, and numbers. A corpus is a collection of texts, usually stored electronically, and from which we perform our analysis. The description of the tm package is here, http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf:
library(NLP)
## Warning: package 'NLP' was built under R version 3.1.3
library(tm)
## Warning: package 'tm' was built under R version 3.1.3
txt<-VectorSource(news)
corpus<-VCorpus(txt)
#inspect(corpus)
we modify the text by making everything lower case, removing punctuation, removing numbers, and removing common English stop words. The ‘tm map’ function allows us to apply transformation functions to a corpus.
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
#corpus <- tm_map(corpus, removeWords, stopwords("english")) #Note that stopwords were intentionally kept so that we can predict them
corpus <- tm_map(corpus, stripWhitespace);
Next we perform stemming, which truncates words (ex. “truncate”, “truncates” & “truncating” translate to “truncat”) using the SnowballC package to identify specific stem elements.
# stem words do not need to be removed, per TA's advice, more valuable for sentiment prediction
require(SnowballC)
corpus<- tm_map(corpus, stemDocument)
detach("package:SnowballC")
Removing profanity and other words you do not want to predict.
badwordsvector <- VectorSource(readLines("Project/badwords.txt"))
corpus <- tm_map(corpus, removeWords, badwordsvector)
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
Perform an exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
# Create the corpus
corpus<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
library(RWeka) #For n-gram vector generation
# Generate one-gram sets, bi-grams sets, tri-grams sets and quad-grams sets
onetoken <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
bitoken <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
tritoken <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
quadtoken <- NGramTokenizer(corpus, Weka_control(min = 4, max = 4, delimiters = " \\r\\n\\t.,;:\"()?!"))
Transforming the n-grams to dataframes and ordering by Frequency for charting.
one <- data.frame(table(onetoken))
two <- data.frame(table(bitoken))
tri <- data.frame(table(tritoken))
quad <- data.frame(table(quadtoken))
onesorted <- one[order(one$Freq,decreasing = TRUE),]
twosorted <- two[order(two$Freq,decreasing = TRUE),]
trisorted <- tri[order(tri$Freq,decreasing = TRUE),]
quadsorted <- quad[order(quad$Freq,decreasing = TRUE),]
# single Word combinations
one20 <- onesorted[1:20,]
colnames(one20) <- c("Word","Frequency")
# bi-word combinations
two20 <- twosorted[1:20,]
colnames(two20) <- c("Word","Frequency")
# tri-word combinations
tri20 <- trisorted[1:20,]
colnames(tri20) <- c("Word","Frequency")
# quad-word combinations
quad20 <- quadsorted[1:20,]
colnames(quad20) <- c("Word","Frequency")
Here is the summary of highest frequency n-gram combinations by size:
head(onesorted)
## onetoken Freq
## 31755 the 21838
## 32227 to 11878
## 1181 and 11259
## 35 a 10557
## 21995 of 9279
## 15590 in 7248
head(twosorted)
## bitoken Freq
## 140551 of the 2006
## 100406 in the 1897
## 211327 to the 981
## 143132 on the 851
## 75628 for the 840
## 209147 to be 750
head(trisorted)
## tritoken Freq
## 230595 one of the 154
## 4326 a lot of 150
## 125239 going to be 77
## 333483 to be a 77
## 170670 it was a 69
## 151112 i want to 65
head(quadsorted)
## quadtoken Freq
## 335192 the end of the 34
## 122948 for the first time 32
## 345048 the rest of the 28
## 44881 at the end of 27
## 402293 when it comes to 27
## 173913 in the middle of 24
To understand frequencies of words and word pairs in the data, we show the uniqueness of words and word pairs in the data:
oneunique <- onesorted[onesorted$Freq == 1,]
nrow(oneunique) # number of unique words
## [1] 19247
sum(onesorted$Freq) # total number of words in the text
## [1] 430667
twounique <- twosorted[twosorted$Freq == 1,]
nrow(twounique) # number of unique bi-grams
## [1] 195502
sum(twosorted$Freq) # total number of bi-grams in the text
## [1] 430666
An interesting observation is that the total number of single words is nearly the same as the total number of bi-grams.
Top 20 single words (sorted alphabetically):
library(ggplot2)
ggplot(one20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Top 20 bi-grams (sorted alphabetically):
ggplot(two20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Top 20 tri-grams (sorted alphabetically):
ggplot(tri20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Top 20 quad-grams (sorted alphabetically):
ggplot(quad20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The goal is to build our first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models.
When comparing the highest frequency results using 4-grams, we did not find that 4-grams were helpful in finding the next word in a n-gram.
A major tradeoff is the amount of data analyzed (corpus size) vs analysis time. Note that stopwords were intentionally kept so that we can predict them, although there is a large number of them.
Adding more lines from the text in the target corpus does not always contribute to a better model accuracy. The model will therefore be built based on qualitative n-gram criteria versus quantitative.
The twitter text includes a lot of non-printable characters and therefore difficult to clean and use.
HC Corpora (www.corpora.heliohost.org)
Johns Hopkins Data Science Capstone, https://www.coursera.org/course/dsscapstone