This report is supposed to show my progress for my capstone project on Natural Language Processing. In this report, I will attempt to demosntrate that I have successfully:
The necessary packages for text mining and NLP are loaded. The data that we will be using is downloaded and read in as lines.
#Loading the necessary packages
library(RWeka)
library(ggplot2)
library(tm)
library(stringi)
#Downloading the data
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists(basename(fileURL))){
download.file(fileURL)
unzip(basename(fileURL))
}
#Reading in the data. UTF-8 encoding setting to accomodate for most type of characters seen in the text.
blogs<- file("./en_US/en_US.blogs.txt")
blogs <- readLines(blogs,encoding = "UTF-8", skipNul = TRUE)
news<- file("./en_US/en_US.news.txt")
news <- readLines(news,encoding = "UTF-8", skipNul = TRUE)
twitter<- file("./en_US/en_US.twitter.txt")
twitter <- readLines(twitter,encoding = "UTF-8", skipNul = TRUE)
A very simple summary is done to gather basic information on the data that we are dealing with.
#Obtaining file size for separate source files
blogs.size <- paste(file.info("./en_US/en_US.blogs.txt")$size / 1024 ^ 2,"MB")
news.size <- paste(file.info("./en_US/en_US.news.txt")$size / 1024 ^ 2,"MB")
twitter.size <- paste(file.info("./en_US/en_US.twitter.txt")$size / 1024 ^ 2,"MB")
#Obtaining word count for separate source files
blogwordcount<-sum(stri_count_words(blogs))
newswordcount<-sum(stri_count_words(news))
twitterwordcount<-sum(stri_count_words(twitter))
#Obtaining number of lines for separate source files
bloglinecount<-length(blogs)
newslinecount<-length(news)
twitterlinecount<-length(twitter)
size<- c(blogs.size,news.size,twitter.size)
wordcount<-c(blogwordcount,newswordcount,twitterwordcount)
linecount<-c(bloglinecount,newslinecount,twitterlinecount)
A summary of the file size, word count and number of lines is produced in a table.
summarytable<-matrix(c(size,wordcount,linecount),nrow =3,byrow = FALSE)
colnames(summarytable)<- c("size","wordcount","linecount")
rownames(summarytable)<-c("blog","news","twitter")
summarytable
## size wordcount linecount
## blog "200.424207687378 MB" "37546246" "899288"
## news "196.277512550354 MB" "2674536" "77259"
## twitter "159.364068984985 MB" "30093410" "2360148"
For the purpose of exploration, only a small subset of data is used out of the corpus.Random sampling is used to subset the required data. To resolve the issue of special and unique characters in order to use the ‘tolower’ function, the files are converted from UTF-8 encoding to ASCII.
#Converting from UTF-8 encoding to ASCII
blogs<- iconv(blogs, 'UTF-8', 'ASCII')
blogs<- na.omit(blogs)
news<- iconv(news, 'UTF-8', 'ASCII')
news<- na.omit(news)
twitter<- iconv(twitter, 'UTF-8', 'ASCII')
twitter<-na.omit(twitter)
#Sampling 1000 lines from each source data
data.sample <- c(sample(blogs, 1000),
sample(news, 1000),
sample(twitter, 1000))
The data is organised into a corpus and cleaned with the help of the tm package. The cleaning process is necessary to extract useful and meaningful contents from the text source, regardless of the text source. The cleaning process removes URLS, special characters, stopwords, unnecessary white spaces, punctuations, numbers as well as change all text to lower case.
corpus <- VCorpus(VectorSource(data.sample))
corpus <- tm_map(corpus, content_transformer(function(x, pattern) gsub(pattern, " ", x)), "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, content_transformer(function(x, pattern) gsub(pattern, " ", x)), "@[^\\s]+")
corpus <- tm_map(corpus,content_transformer(stringi::stri_trans_tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Now, we are ready to explore our cleaned corpuscorpus<-corpus[which(is.na(corpuss))] <- “NULLVALUE”!
With the help of RWeka package, I am able to find out the most frequently occuring one,two and three word clusters in the corpus. Structuring my findings into a data frame, I made use of the popular ggplot package to output my results in histograms.
#Setting the function for uni-,bi- and tri- grams
unigram<-TermDocumentMatrix(corpus)
#Creating bigram tokenizing function to recognise two words cluster.
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram<-TermDocumentMatrix(corpus, control = list(tokenize = bigram))
#Creating trigram tokenizing functions to recognise three words cluster.
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram<-TermDocumentMatrix(corpus, control = list(tokenize = trigram))
# To structure the TermDocumentMatrix in a data frame
frequency_dataframe<-function(x){
ngram<-sort(rowSums(as.matrix(x)), decreasing = TRUE)
return(data.frame(ngram=names(ngram),frequency=ngram))
}
#Setting the plotting function. Only the top 30 n-grams are plotted.
histogram <- function(corpus,x_axis,title) {
corpus<-frequency_dataframe(corpus)
corpus<-corpus[1:30,]
ggplot(corpus, aes(x=reorder(ngram, -frequency),y= frequency)) +
geom_bar(stat = "identity", colour = "black")+
labs(title=title,x = x_axis, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1))
}
histogram(unigram,"unigram" ,"Top 30 Unigrams")
histogram(bigram,"bigram" ,"Top 30 Bigrams")
histogram(trigram,"trigram" ,"Top 30 Trigrams")
Exploratory analysis complete. This concludes my milestone report.