Milestone Report

This report is meant to show my progress for my capstone project on Natural Language Processing. In this report, I will attempt to demonstrate that I have successfully:

Downloaded the data and successfully loaded it in for analysis.
Created a basic report of summary statistics about the data set.
Reported any intersting findings I have amassed so far.

There are more than one more approach to this capstone project as there are multiple packages developed to process texts.

My initial attempt at this involves using the tm package to clean the data and using Rweka for creating n-grams. While it does the work required, it was an ardous task putting it all together to make it work.
I soon discovered the Quanteda package, an incredibly powerful user-developed tool for natural language processing which made this milestone project a much easier task compared to my previous method.

More info on the Quanteda package and its functions can be found here.

Loading the data

The necessary packages for text mining and NLP are loaded. The data that we will be using is downloaded and read in as lines.

#Loading the necessary packages
library(quanteda)
library(ggplot2)
library(stringi)
library(stopwords)

#Downloading the data
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists(basename(fileURL))){
    download.file(fileURL)
    unzip(basename(fileURL))
}

#Reading in the data. UTF-8 encoding setting to accomodate for most type of characters seen in the text.
blogs<- file("./en_US/en_US.blogs.txt")
blogs <- readLines(blogs,encoding = "UTF-8", skipNul = TRUE)


news<- file("./en_US/en_US.news.txt")
news <- readLines(news,encoding = "UTF-8", skipNul = TRUE)


twitter<- file("./en_US/en_US.twitter.txt")
twitter <- readLines(twitter,encoding = "UTF-8", skipNul = TRUE)

A very simple summary is done to gather basic information on the data that we are dealing with.

#Obtaining file size for separate source files

blogs.size <- paste(file.info("./en_US/en_US.blogs.txt")$size / 1024 ^ 2,"MB")
news.size <- paste(file.info("./en_US/en_US.news.txt")$size / 1024 ^ 2,"MB")
twitter.size <- paste(file.info("./en_US/en_US.twitter.txt")$size / 1024 ^ 2,"MB")


#Obtaining word count for separate source files
blogwordcount<-sum(stri_count_words(blogs))
newswordcount<-sum(stri_count_words(news))
twitterwordcount<-sum(stri_count_words(twitter))


#Obtaining number of lines for separate source files
bloglinecount<-length(blogs)
newslinecount<-length(news)
twitterlinecount<-length(twitter)

size<- c(blogs.size,news.size,twitter.size)
wordcount<-c(blogwordcount,newswordcount,twitterwordcount)
linecount<-c(bloglinecount,newslinecount,twitterlinecount)

A summary of the file size, word count and number of lines is produced in a table.

summarytable<-matrix(c(size,wordcount,linecount),nrow =3,byrow = FALSE)
colnames(summarytable)<- c("size","wordcount","linecount")
rownames(summarytable)<-c("blog","news","twitter")
summarytable

##         size                  wordcount  linecount
## blog    "200.424207687378 MB" "37546246" "899288" 
## news    "196.277512550354 MB" "2674536"  "77259"  
## twitter "159.364068984985 MB" "30093410" "2360148"

Cleaning the Data

For the purpose of exploration, only a small subset of data is used out of the dataset.Random sampling is used to subset the required data. To resolve the issue of special and unique characters in order to use the ‘tolower’ function, the files are converted from UTF-8 encoding to ASCII.

#Converting from UTF-8 encoding to ASCII 
blogs<- iconv(blogs, 'UTF-8', 'ASCII')
blogs<- na.omit(blogs)

news<- iconv(news, 'UTF-8', 'ASCII')
news<- na.omit(news)


twitter<- iconv(twitter, 'UTF-8', 'ASCII')
twitter<-na.omit(twitter)

#Sampling 1000 lines from each source data
data.sample <- c(sample(blogs, 1000),
                 sample(news, 1000),
                 sample(twitter, 1000))

The cleaning process is necessary to extract useful and meaningful contents from the text source, regardless of the text source.

Making use of the powerful Quanteda package, I tokenized the data subset into the one,two and three word clusters.

The cleaning process removes URLS, special characters, word separators,stopwords, twitter characters, punctuations, numbers change all text to lower case and splits hyphenated words into two.

Subsequently, I am able to identify and sort the most frequently occuring ngrams using the topfeature function.

Documentfeaturematrix<-function(x,y){
  token<-tokens(char_tolower(x),remove=stopwords("english"),remove_numbers = T, remove_punct = T,
remove_symbols = T, remove_separators = T,
remove_twitter = T, remove_hyphens = T, remove_url = T)
  dfm(token,ngrams=y)
  
  }

Top30ngrams<-function(x){
  df<-topfeatures(x,30)
  data.frame(ngram=names(df), freq=df, row.names=NULL)
}

histogram <- function(df,x_axis,title) {
    ngramchart<-df[1:30,]
  ggplot(ngramchart, aes(x=reorder(ngram, -freq),y= freq)) +
    geom_bar(stat = "identity", colour = "black")+
    labs(title=title,x = x_axis, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) 
  
}

Now, we are ready to explore our cleaned corpuscorpus<-corpus[which(is.na(corpuss))] <- “NULLVALUE”!

Data Exploration and Basic Visualization of Results

Structuring my findings into a document feature matrix, I made use of the popular ggplot package to output my results in histograms.

unigram histogram

dfm1<-Documentfeaturematrix(data.sample,1L)
unigram<-Top30ngrams(dfm1)
histogram(unigram,"unigram" ,"Top 30 Unigrams")

bi-gram histogram

dfm2<-Documentfeaturematrix(data.sample,2L)
bigram<-Top30ngrams(dfm2)
histogram(bigram,"bigram" ,"Top 30 Bigrams")

tri-gram histogram

dfm3<-Documentfeaturematrix(data.sample,3L)
trigram<-Top30ngrams(dfm3)
histogram(trigram,"trigram" ,"Top 30 Trigrams")

Exploratory analysis complete. This concludes my milestone report.

The next step is to produce a functioning Shiny app that can predict the next word from input phrases

Data Science Capstone Project

Ler Wei Han

October 11, 2018

Milestone Report

Loading the data

Cleaning the Data

Data Exploration and Basic Visualization of Results