This Milestone Report is a part of capstone project of the Data Science Capstone Course of Data Science Specialization by Johns Hopkins University on Coursera. This Milestone Report focuses on the application of data science in the field of natural language processing. The primary objective of this project is to get acquainted with Natural Language Processing, Text Mining, and the relevant tools available in R. Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this Report, the English database is used in this analysis.
Following tasks are performed:
Download, unzip and load the training data.
url<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists('Coursera-capestone-dataset.zip')){
download.file(url,destfile = "Coursera-capestone-dataset.zip")
unzip("Coursera-capestone-dataset.zip")
}
checking the files with the data in the dataset.
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
The data contaaints 4 files with corresponding languages
German/deutsch de_DE, English en_US, Finnish
fi_FI and Russian ru_RU. Only considering the
English en_US data.
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
It contains Three files:
#File destinations
blogs.file<-"final/en_US/en_US.blogs.txt"
news.file<-"final/en_US/en_US.news.txt"
twitter.file<-"final/en_US/en_US.twitter.txt"
#blogs
blogs<-readLines(blogs.file,encoding="UTF-8",skipNul = TRUE)
# News
news<-readLines(news.file,encoding="UTF-8",skipNul = TRUE, warn=FALSE)
# Twitter
twitter<-readLines(twitter.file,encoding="UTF-8",skipNul = TRUE)
library(stringi)
# File Size
File.size<- paste0(round((file.info(c(blogs.file,news.file,twitter.file))$size/1024^2))," MB")
# Number of lines
Lines<-sapply(list(blogs,news,twitter), length)
# Number of characters
Characters<-sapply(list(nchar(blogs),nchar(news),nchar(twitter)), sum)
# Number of words
Words<- sapply(list(blogs, news, twitter),stri_stats_latex)[4,]
data.summary<-data.frame(File = c("blogs", "news", "twitter"),
File.size,Lines,Characters,Words)
library(knitr)
kable(data.summary)
| File | File.size | Lines | Characters | Words |
|---|---|---|---|---|
| blogs | 200 MB | 899288 | 206824505 | 37570839 |
| news | 196 MB | 77259 | 15639408 | 2651432 |
| 159 MB | 2360148 | 162096241 | 30451170 |
To perform exploratory analysis, we will use a sample from the data since usng all the data will just increase the computational load.Therefore,Taking 1% lines form the data.
set.seed(5463)
# Sample size
sample.size=0.05
# Sample Data
Sample.blogs<-sample(blogs,length(blogs)*sample.size,replace =F)
Sample.news<-sample(news,length(news)*sample.size,replace =F)
Sample.twitter<-sample(twitter,length(twitter)*sample.size,replace =F)
#complete data file
sample.Data<-c(Sample.blogs,Sample.news,Sample.twitter)
# remove all non-English characters from the sampled data
sample.Data <- iconv(sample.Data, "latin1", "ASCII", sub = "")
# write the complete data file into a text file
sample.con<-file("sample-data.txt",open='w')
writeLines(sample.Data,sample.con)
close(sample.con)
Before we clean the data, we will create a VCorpus to organize the text data by using r package ‘quanteda’.
suppressMessages(library(quanteda))
sample.Data.corpus <- corpus(sample.Data)
To clean the the data we will employ following steps:
sample.Data.corpus.tokens<-tokens(sample.Data.corpus,
what="word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_url =TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
split_hyphens = FALSE)
removing common words like “the,” “and,” “is,” etc, which are not
necessary using stopwords().
sample.Data.corpus.tokens.no.stopwords<-tokens_remove(sample.Data.corpus.tokens,pattern=stopwords("en"))
Profanity Filter- removing profanity and other words that we do not want to predict. For performing profanity filter on data we will use the british swear words data provded by www.freewebheaders.com.
# download bad words file
url.bad.words<-"https://www.freewebheaders.com/download/files/british-swear-words-list_text-file.zip"
if(!file.exists("bad-words.zip")){
download.file(url.bad.words,destfile = "bad-words.zip")
unzip("bad-words.zip")
}
# Read bad words file
bad.words<-readLines("british-swear-words-list_text-file.txt",warn=F)[-c(1:9)]
In this part, the aim is to understand the distribution and understanding the distribution and relationship between the words, tokens, and phrases in the text.
Tasks to accomplish:
Build basic n-gram model - using the exploratory
analysis you performed, build a basic n-gram model for predicting the
next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen n-grams - in some
cases people will want to type a combination of words that does not
appear in the corpora. Build a model to handle cases where a particular
n-gram isn’t observed.
# Unigram
Unigram<-tokens_ngrams(sample.Data.corpus.tokens.no.stopwords,n=1)
A document-feature matrix is crated for the
unigram and alll words are converted to lower and profanity
will be removed.
text.unigram<-dfm(Unigram,tolower=TRUE,remove_padding = TRUE,
remove=bad.words,verbose=FALSE)
unigrams in the
corpus.# The most frequent occurring words in the unigrams
Unigram.freq<-topfeatures(text.unigram,100)
Unigram.freq.df<-data.frame(Unigram=names(Unigram.freq),freq=Unigram.freq)
Plotting a bar chart for word frequencies.
library(ggplot2)
# Word frequency vs word 10 most frequent words bar plot
g<-ggplot(Unigram.freq.df[1:20,],aes(x = reorder(Unigram, -freq), y = freq))
g<-g+geom_bar(stat = "Identity",fill = I("lightblue"))
g<-g+geom_text(aes(label=Unigram.freq.df[1:20,]$freq),vjus=-0.2,size=3)
g<-g+xlab("Words")+ylab("frequency")+ggtitle("10 Most Frequent Words")
g<-g+theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.text.y = element_text(angle = 0, hjust = 1))
g
unigrams.library(wordcloud2)
wordcloud2(data=Unigram.freq.df,size=0.5,minRotation = -pi/6, maxRotation = -pi/6, minSize = 10,
rotateRatio = 1,shape = "circle",color='random-dark')
text.bigram<-dfm(tokens_ngrams(sample.Data.corpus.tokens.no.stopwords,n=2,concatenator = " "),
tolower=TRUE,remove_padding = TRUE,
remove=bad.words,verbose=FALSE)
# bigram word Freequencies
bigram.freq<-topfeatures(text.bigram,100)
bigram.freq.df<-data.frame(bigram=names(bigram.freq),freq=bigram.freq)
Plotting a bar chart for word frequencies.
library(ggplot2)
# Word frequency vs word 10 most frequent words bar plot
g<-ggplot(bigram.freq.df[1:20,],aes(x = reorder(bigram, -freq), y = freq))
g<-g+geom_bar(stat = "Identity",fill = I("lightblue"))
g<-g+geom_text(aes(label=bigram.freq.df[1:20,]$freq),vjus=-0.2,size=3)
g<-g+xlab("Words")+ylab("frequency")+ggtitle("10 Most Frequent Words")
g<-g+theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 0, hjust = 1))
g
bigrams.wordcloud2(data=bigram.freq.df,size=0.4,minRotation = -pi/6, maxRotation = -pi/6, minSize = 10,
rotateRatio = 1,shape = "circle",
color='random-dark')
text.trigram<-dfm(tokens_ngrams(sample.Data.corpus.tokens.no.stopwords,n=3,concatenator = " "),
tolower=TRUE,remove_padding = TRUE,
remove=bad.words,verbose=FALSE)
# bigram word Freequencies
trigram.freq<-topfeatures(text.trigram,100)
trigram.freq.df<-data.frame(trigram=names(trigram.freq),freq=trigram.freq)
Plotting a bar chart for word frequencies.
library(ggplot2)
# Word frequency vs word 10 most frequent words bar plot
g<-ggplot(trigram.freq.df[1:20,],aes(x = reorder(trigram, -freq), y = freq))
g<-g+geom_bar(stat = "Identity",fill = I("lightblue"))
g<-g+geom_text(aes(label=trigram.freq.df[1:20,]$freq),vjus=-0.2,size=3)
g<-g+xlab("Words")+ylab("frequency")+ggtitle("10 Most Frequent Words")
g<-g+theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 0, hjust = 1))
g
trigrams.wordcloud2(data=trigram.freq.df,size=0.35,minRotation = -pi/6, maxRotation = -pi/6, minSize = 10,
rotateRatio = 1,shape = "circle",color='random-dark' )
Through exploring the data, we identified the top 20 most frequently used words, visualizing them in a word cloud. It’s worth noting that the most frequent words were common ones like “the,” “a,” “I,” etc. To focus on less common words, we had to exclude these highly common ones.
The primary plan is to leverage the initial data analysis provided
here to advance the development of the prediction algorithm required for
the Shiny application. Following this exploratory analysis, one approach
is to predict the next word using n-gram analysis and selecting an
appropriate model such as backoff model, all integrated
into a Shiny app.