This report is meant to show my progress for my capstone project on Natural Language Processing. In this report, I will attempt to demonstrate that I have successfully:
There are more than one more approach to this capstone project as there are multiple packages developed to process texts.
My initial attempt at this involves using the tm package to clean the data and using Rweka for creating n-grams. While it does the work required, it was an ardous task putting it all together to make it work.
I soon discovered the Quanteda package, an incredibly powerful user-developed tool for natural language processing which made this milestone project a much easier task compared to my previous method.
More info on the Quanteda package and its functions can be found here.
The necessary packages for text mining and NLP are loaded. The data that we will be using is downloaded and read in as lines.
#Loading the necessary packages
library(quanteda)
library(ggplot2)
library(stringi)
library(stopwords)
#Downloading the data
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists(basename(fileURL))){
download.file(fileURL)
unzip(basename(fileURL))
}
#Reading in the data. UTF-8 encoding setting to accomodate for most type of characters seen in the text.
blogs<- file("./en_US/en_US.blogs.txt")
blogs <- readLines(blogs,encoding = "UTF-8", skipNul = TRUE)
news<- file("./en_US/en_US.news.txt")
news <- readLines(news,encoding = "UTF-8", skipNul = TRUE)
twitter<- file("./en_US/en_US.twitter.txt")
twitter <- readLines(twitter,encoding = "UTF-8", skipNul = TRUE)
A very simple summary is done to gather basic information on the data that we are dealing with.
#Obtaining file size for separate source files
blogs.size <- paste(file.info("./en_US/en_US.blogs.txt")$size / 1024 ^ 2,"MB")
news.size <- paste(file.info("./en_US/en_US.news.txt")$size / 1024 ^ 2,"MB")
twitter.size <- paste(file.info("./en_US/en_US.twitter.txt")$size / 1024 ^ 2,"MB")
#Obtaining word count for separate source files
blogwordcount<-sum(stri_count_words(blogs))
newswordcount<-sum(stri_count_words(news))
twitterwordcount<-sum(stri_count_words(twitter))
#Obtaining number of lines for separate source files
bloglinecount<-length(blogs)
newslinecount<-length(news)
twitterlinecount<-length(twitter)
size<- c(blogs.size,news.size,twitter.size)
wordcount<-c(blogwordcount,newswordcount,twitterwordcount)
linecount<-c(bloglinecount,newslinecount,twitterlinecount)
A summary of the file size, word count and number of lines is produced in a table.
summarytable<-matrix(c(size,wordcount,linecount),nrow =3,byrow = FALSE)
colnames(summarytable)<- c("size","wordcount","linecount")
rownames(summarytable)<-c("blog","news","twitter")
summarytable
## size wordcount linecount
## blog "200.424207687378 MB" "37546246" "899288"
## news "196.277512550354 MB" "2674536" "77259"
## twitter "159.364068984985 MB" "30093410" "2360148"
For the purpose of exploration, only a small subset of data is used out of the dataset.Random sampling is used to subset the required data. To resolve the issue of special and unique characters in order to use the ‘tolower’ function, the files are converted from UTF-8 encoding to ASCII.
#Converting from UTF-8 encoding to ASCII
blogs<- iconv(blogs, 'UTF-8', 'ASCII')
blogs<- na.omit(blogs)
news<- iconv(news, 'UTF-8', 'ASCII')
news<- na.omit(news)
twitter<- iconv(twitter, 'UTF-8', 'ASCII')
twitter<-na.omit(twitter)
#Sampling 1000 lines from each source data
data.sample <- c(sample(blogs, 1000),
sample(news, 1000),
sample(twitter, 1000))
The cleaning process is necessary to extract useful and meaningful contents from the text source, regardless of the text source.
Making use of the powerful Quanteda package, I tokenized the data subset into the one,two and three word clusters.
The cleaning process removes URLS, special characters, word separators,stopwords, twitter characters, punctuations, numbers change all text to lower case and splits hyphenated words into two.
Subsequently, I am able to identify and sort the most frequently occuring ngrams using the topfeature function.
Documentfeaturematrix<-function(x,y){
token<-tokens(char_tolower(x),remove=stopwords("english"),remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_separators = T,
remove_twitter = T, remove_hyphens = T, remove_url = T)
dfm(token,ngrams=y)
}
Top30ngrams<-function(x){
df<-topfeatures(x,30)
data.frame(ngram=names(df), freq=df, row.names=NULL)
}
histogram <- function(df,x_axis,title) {
ngramchart<-df[1:30,]
ggplot(ngramchart, aes(x=reorder(ngram, -freq),y= freq)) +
geom_bar(stat = "identity", colour = "black")+
labs(title=title,x = x_axis, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1))
}
Now, we are ready to explore our cleaned corpuscorpus<-corpus[which(is.na(corpuss))] <- “NULLVALUE”!
Structuring my findings into a document feature matrix, I made use of the popular ggplot package to output my results in histograms.
dfm1<-Documentfeaturematrix(data.sample,1L)
unigram<-Top30ngrams(dfm1)
histogram(unigram,"unigram" ,"Top 30 Unigrams")
dfm2<-Documentfeaturematrix(data.sample,2L)
bigram<-Top30ngrams(dfm2)
histogram(bigram,"bigram" ,"Top 30 Bigrams")
dfm3<-Documentfeaturematrix(data.sample,3L)
trigram<-Top30ngrams(dfm3)
histogram(trigram,"trigram" ,"Top 30 Trigrams")
Exploratory analysis complete. This concludes my milestone report.
The next step is to produce a functioning Shiny app that can predict the next word from input phrases