Milestone Report

This report is supposed to show my progress for my capstone project on Natural Language Processing. In this report, I will attempt to demosntrate that I have successfully:

  1. Downloaded the data and successfully loaded it in for analysis.
  2. Created a basic report of summary statistics about the data set.
  3. Reported any intersting findings I have amassed so far.

Loading the data

The necessary packages for text mining and NLP are loaded. The data that we will be using is downloaded and read in as lines.

#Loading the necessary packages
library(RWeka)
library(ggplot2)
library(tm)
library(stringi)
#Downloading the data
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists(basename(fileURL))){
    download.file(fileURL)
    unzip(basename(fileURL))
}
#Reading in the data. UTF-8 encoding setting to accomodate for most type of characters seen in the text.
blogs<- file("./en_US/en_US.blogs.txt")
blogs <- readLines(blogs,encoding = "UTF-8", skipNul = TRUE)


news<- file("./en_US/en_US.news.txt")
news <- readLines(news,encoding = "UTF-8", skipNul = TRUE)


twitter<- file("./en_US/en_US.twitter.txt")
twitter <- readLines(twitter,encoding = "UTF-8", skipNul = TRUE)

A very simple summary is done to gather basic information on the data that we are dealing with.

#Obtaining file size for separate source files

blogs.size <- paste(file.info("./en_US/en_US.blogs.txt")$size / 1024 ^ 2,"MB")
news.size <- paste(file.info("./en_US/en_US.news.txt")$size / 1024 ^ 2,"MB")
twitter.size <- paste(file.info("./en_US/en_US.twitter.txt")$size / 1024 ^ 2,"MB")


#Obtaining word count for separate source files
blogwordcount<-sum(stri_count_words(blogs))
newswordcount<-sum(stri_count_words(news))
twitterwordcount<-sum(stri_count_words(twitter))


#Obtaining number of lines for separate source files
bloglinecount<-length(blogs)
newslinecount<-length(news)
twitterlinecount<-length(twitter)

size<- c(blogs.size,news.size,twitter.size)
wordcount<-c(blogwordcount,newswordcount,twitterwordcount)
linecount<-c(bloglinecount,newslinecount,twitterlinecount)

A summary of the file size, word count and number of lines is produced in a table.

summarytable<-matrix(c(size,wordcount,linecount),nrow =3,byrow = FALSE)
colnames(summarytable)<- c("size","wordcount","linecount")
rownames(summarytable)<-c("blog","news","twitter")
summarytable
##         size                  wordcount  linecount
## blog    "200.424207687378 MB" "37546246" "899288" 
## news    "196.277512550354 MB" "2674536"  "77259"  
## twitter "159.364068984985 MB" "30093410" "2360148"

Cleaning the Data

For the purpose of exploration, only a small subset of data is used out of the corpus.Random sampling is used to subset the required data. To resolve the issue of special and unique characters in order to use the ‘tolower’ function, the files are converted from UTF-8 encoding to ASCII.

#Converting from UTF-8 encoding to ASCII 
blogs<- iconv(blogs, 'UTF-8', 'ASCII')
blogs<- na.omit(blogs)

news<- iconv(news, 'UTF-8', 'ASCII')
news<- na.omit(news)


twitter<- iconv(twitter, 'UTF-8', 'ASCII')
twitter<-na.omit(twitter)

#Sampling 1000 lines from each source data
data.sample <- c(sample(blogs, 1000),
                 sample(news, 1000),
                 sample(twitter, 1000))

The data is organised into a corpus and cleaned with the help of the tm package. The cleaning process is necessary to extract useful and meaningful contents from the text source, regardless of the text source. The cleaning process removes URLS, special characters, stopwords, unnecessary white spaces, punctuations, numbers as well as change all text to lower case.

corpus <- VCorpus(VectorSource(data.sample))
corpus <- tm_map(corpus, content_transformer(function(x, pattern) gsub(pattern, " ", x)), "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, content_transformer(function(x, pattern) gsub(pattern, " ", x)), "@[^\\s]+")
corpus <- tm_map(corpus,content_transformer(stringi::stri_trans_tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Now, we are ready to explore our cleaned corpuscorpus<-corpus[which(is.na(corpuss))] <- “NULLVALUE”!

Data Exploration and Basic Visualization of Results

With the help of RWeka package, I am able to find out the most frequently occuring one,two and three word clusters in the corpus. Structuring my findings into a data frame, I made use of the popular ggplot package to output my results in histograms.

#Setting the function for uni-,bi- and tri- grams

unigram<-TermDocumentMatrix(corpus)


#Creating bigram tokenizing function to recognise two words cluster.
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram<-TermDocumentMatrix(corpus, control = list(tokenize = bigram))

#Creating trigram tokenizing functions to recognise three words cluster.
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram<-TermDocumentMatrix(corpus, control = list(tokenize = trigram))



# To structure the TermDocumentMatrix in a data frame
frequency_dataframe<-function(x){
   ngram<-sort(rowSums(as.matrix(x)), decreasing = TRUE)
  return(data.frame(ngram=names(ngram),frequency=ngram))

}

#Setting the plotting function. Only the top 30 n-grams are plotted.
histogram <- function(corpus,x_axis,title) {
  corpus<-frequency_dataframe(corpus)
  corpus<-corpus[1:30,]
  ggplot(corpus, aes(x=reorder(ngram, -frequency),y= frequency)) +
    geom_bar(stat = "identity", colour = "black")+
    labs(title=title,x = x_axis, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) 
    
}
  1. unigram histogram
histogram(unigram,"unigram" ,"Top 30 Unigrams")

  1. bi-gram histogram
histogram(bigram,"bigram" ,"Top 30 Bigrams")

  1. tri-gram histogram
histogram(trigram,"trigram" ,"Top 30 Trigrams")

Exploratory analysis complete. This concludes my milestone report.