This is a milestone report which is a part of the Capstone Project in the Data Science Specialization offered by Johns Hopking University in Coursera.org. The main objective of this report is to develop an understanding of statistical properties of the dataset which can be applied to Natural Language Processing (NLP) in order to build a predictive text application. In this application user will input text, as text is typed by the user, predictive model will recommend the next possible word(s) to be appended to the input stream. The final product will be used in a Shiny application platform, which will allow the users to type an input text and suggestion of the next text prediction in a web based environment.
The text data can be downloaded here and is provided in four different languages. We will be using the English corpora. The model will be trained using a document corpus compiled from the follwing three sources of text data:
library(knitr)
library(R.utils)
library(stringi)
library(quanteda)
library(ggplot2)
library(tidytext)
library(ngram)
library(plotly)
library(data.table)
Download, unzip, and load the data
rm(list=ls())
if(!file.exists("~/Jordan/Desktop")){
dir.create("~~/Jordan/Desktop")
}
if(!file.exists("~/Jordan/Desktop/final")){
unzip(zipfile="~/Jordan/Desktop/Coursera-SwiftKey.zip",exdir="~/Jordan/Desktop")
}
## Warning in unzip(zipfile = "~/Jordan/Desktop/Coursera-SwiftKey.zip", exdir
## = "~/Jordan/Desktop"): error 1 in extracting from zip file
For convenience file path is set and file size obtained to gauge the file size for three different sources. File size is converted to Megabytes.
blogs_path <- "C:/Users/Jordan/Desktop/final/en_US/en_US.blogs.txt"
twitter_path <- "C:/Users/Jordan/Desktop/final/en_US/en_US.twitter.txt"
news_path <- "C:/Users/Jordan/Desktop/final/en_US/en_US.news.txt"
sb <- file.info(blogs_path)$size/1024^2
st <- file.info(twitter_path)$size/1024^2
sn <- file.info(news_path)$size/1024^2
Lines are read from each source and number of lines and words are counted.
#read lines
blogs<-readLines(blogs_path,warn=FALSE,encoding="UTF-8")
twitter<-readLines(twitter_path,warn=FALSE,encoding="UTF-8")
news<-readLines(news_path,warn=FALSE,encoding="UTF-8")
# count words per line
nwlblogs <- stri_count_words(blogs)
nwltwitter <- stri_count_words(twitter)
nwlnews<- stri_count_words(news)
#count number of words
nwblogs <- wordcount(blogs, sep = " ")
nwtwitter <- wordcount(twitter, sep = " ")
nwnews <- wordcount(news, sep = " ")
#count number of lines
nlblogs <- countLines(blogs_path)
nltwitter <- countLines(twitter_path)
nlnews <- countLines(news_path)
There are more than 30 million words in each source which is a great assest to create predicting algorithm. Variables AWP and MWP is mean words per line and median words per line respectively.
data <- data.table(
Items = c("Blogs", "Twitter", "News"),
FileName=c("en_US.blogs.txt","en_US.twitter.txt", "en_US.news.txt "),
Size_MB = c(sb, st, sn),
Words = c(nwblogs, nwtwitter, nwnews),
Lines = c(nlblogs, nltwitter, nlnews),
AWP = c(mean(nwlblogs), mean(nwltwitter), mean(nwlnews)),
MWP = c(median(nwlblogs), median(nwltwitter), median(nwlnews))
)
data
Items FileName Size_MB Words Lines AWP MWP
1: Blogs en_US.blogs.txt 200.4242 37334131 899288 41.75107 28
2: Twitter en_US.twitter.txt 159.3641 30373543 2360148 12.75063 12
3: News en_US.news.txt 196.2775 2643969 1010242 34.61779 32
The plot for ratio of words per line was created for visualization purpose. As per the plot, ratio of words/line is the highest in the blog.
ratio = data.frame(ratio=c(nwblogs/nlblogs, nwtwitter/nltwitter, nwnews/nlnews), media=as.factor(c("Blogs", "Twitter", "News")))
ggplot(data = ratio, aes(x=media, y= ratio, fill =media)) +
geom_bar(stat="identity") +
labs(title="Ratio of Words/Line in different media sources", x="Media Source",y="Ratio of Words/Line")
Dataset of three different sources will be sampled at 10% for training. This 10% is an arbitary number for the training data which can be modified as required by the prediction model for accuracy. All non english characters are removed from the subset of the data and then combined into a single data set for a corpus sample.
set.seed(165)
#selecting ten percent of data as training data
samplesize <- 0.1
#Sample for each source
blogsS <- blogs[sample(1:length(blogs), length(blogs) * samplesize)]
twitterS<- twitter[sample(1:length(twitter), length(twitter) * samplesize)]
newsS <- news[sample(1:length(news), length(news) * samplesize)]
# remove all non-English characters from the sampled data
blogsS <- iconv(blogsS, "latin1", "ASCII", sub = "")
newsS <- iconv(newsS , "latin1", "ASCII", sub = "")
twitterS <- iconv(twitterS, "latin1", "ASCII", sub = "")
Sample <- c(blogsS, twitterS, newsS)
Obscene languages should be removed. The dataset of profane words can be obtained from School of Computer Science, Carnegie Mellon University which has more than 1,300 blasphemous words.
if(!file.exists("./swearWords.txt")){
download.file(url = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt",
destfile= "./swearWords.txt",
method = "curl")
}
profanity <- readLines("./swearWords.txt",warn=FALSE, encoding = "UTF-8")
The following transformations is made in the sample dataset. * Remove numbers * Remove punctuation marks * Remove URL * Remove separators * Remove symbols * Remove Twitter handles * Applied Stopwords * Converted to lower case * Stemming
# Build tokens using "quanteda" package
t <- tokens(Sample,
what="word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_url =TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
remove_twitter = TRUE,
verbose = TRUE)
# Remove stopwords
t <- tokens_replace(t,pattern =stopwords("english"),replacement=stopwords("english"))
#Set lower case for every word
t <- tokens_tolower(t)
#Apply stemmer to words
t <- tokens_wordstem(t, language = "english")
t.1gram <- tokens_ngrams(t, n = 1, concatenator = " ")
t.2gram <- tokens_ngrams(t, n = 2, concatenator = " ")
t.3gram <- tokens_ngrams(t, n = 3, concatenator = " ")
The predictive model for the Shiny application will handle uniqrams, bigrams, and trigrams. The quanteda package is used to construct functions construct matrices of uniqrams, bigrams, and trigrams from the tokenized data.
unigram <- dfm(t.1gram, verbose = FALSE,remove = profanity )
## Warning: 'remove' is deprecated; use dfm_remove() instead
bigram <- dfm(t.2gram, verbose = FALSE, remove = profanity)
## Warning: 'remove' is deprecated; use dfm_remove() instead
trigram <- dfm(t.3gram, verbose = FALSE, remove = profanity)
## Warning: 'remove' is deprecated; use dfm_remove() instead
Plotting the top ten Unigram frequency.
topUniVector <- topfeatures(unigram, 10)
topUniVector <- sort(topUniVector, decreasing = FALSE)
topUniDf <- data.frame(words = names(topUniVector), freq = topUniVector)
topUniPlot <- ggplot(data = topUniDf, aes(x = reorder(words, freq), y = freq, fill = freq)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(x = "Unigram", y = "Frequency") +
labs(title = "Ten most common Unigrams") +
theme(plot.title = element_text(hjust = 0.5))+
coord_flip() +
guides(fill=FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use
## `guides(<scale> = "none")` instead.
g1<- ggplotly(topUniPlot)
g1
Plotting the top ten Bigram frequency.
topBiVector <- topfeatures(bigram, 10)
topBiVector <- sort(topBiVector, decreasing = FALSE)
topBiDf <- data.frame(words = names(topBiVector), freq = topBiVector)
topBiPlot <- ggplot(data = topBiDf, aes(x = reorder(words, freq), y = freq, fill = freq)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(x = "Bigram", y = "Frequency") +
labs(title = "Ten most common Bigrams") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip() +
guides(fill=FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use
## `guides(<scale> = "none")` instead.
g2<- ggplotly(topBiPlot)
g2
Plotting the top ten Trigram frequency.
topTriVector <- topfeatures(trigram, 10)
topTriVector <- sort(topTriVector, decreasing = FALSE)
topTriDf <- data.frame(words = names(topTriVector), freq = topTriVector)
topTriPlot <- ggplot(data = topTriDf, aes(x = reorder(words, freq), y = freq, fill = freq)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(x = "Trigram", y = "Frequency") +
labs(title = "Ten most common Trigrams") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip() +
guides(fill=FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use
## `guides(<scale> = "none")` instead.
g3<- ggplotly(topTriPlot)
g3
The final part of this capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface in the web browser. In Shiny app will take multiple words input in a text box and output a prediction of the next word.
The predictive algorithm will be created using an n-gram model with a frequency lookup to that performed in the exploratory data analysis section of this report.Based on the exploratory analysis here are the plans: