Abstract

The goal of this report is to explore the datasets that will be used in developing a prediction algorithm for Natural Language Processing. Three datasets are provided, each containing large amount of texts obtained from online blogs, news and twitter feeds. This report captures a summary statistics of each dataset and any interesting findings from the exploratory analysis. The final part of the report will introduce our plan for creating a prediction algorithm and Shiny app for our product.

Hardware & Software Information

  1. Operating System: Windows 7 Pro 64-bit
  2. R Software: Version 3.3.1
  3. Packages Used:
    • NLP
    • TM
    • RWeka

Method

Three datasets are provided, each containing large amount of texts obtained from online blogs, news and twitter feeds. In this report, we will:

  1. Download the data and successfully load it into R.
  2. Generate a summary statistics of each dataset.
  3. Generate a histograms to describe the most used words and their frequency in each dataset.
  4. Introduce a plan for creating a prediction algorithm and Shiny app of the product.

Throughout the report, any interesting findings during the exploratory analysis will be recorded.

Task 1: Downloading and loading the datasets into R

The datasets are downloaded from the provided link in the course instructions.

setwd("~//R//CapstoneProject")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url,destfile = "./Data/Coursera-SwiftKey.zip")
unzip(zipfile="./Data/Coursera-SwiftKey.zip",exdir="./Data")
blogFile <- "./Data/final/en_US/en_US.blogs.txt"
newsFile <- "./Data/final/en_US/en_US.news.txt"
twitterFile <- "./Data/final/en_US/en_US.twitter.txt"

There are three files representing three datasets for blogs, news and twitter feeds. They contain large amounts of data of around 200MB. The data in the files are read in one line at a time temporarily to obtain statistics of these datasets.

Task 2: Generating a summary statistics of each dataset

To better understand the datasets, we examine three key features in each dataset - total number of words, number of lines in the file, and the length of the longest line. The length of the longest line is equivalent to the largest number of words on a single line. The summary statistics for each dataset are presented in Table 1.

#Create function to open a given file and calculate the total word count in files
wordCount = 0
lineCount = 0
maxLineLength = 0
summaryStats = function(filepath) {
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1, skipNul = TRUE)
    if ( length(line) == 0 ) {
      break
    }
    lineCount <- lineCount + 1
    lineLength <- sapply(gregexpr("\\W+", line), length) + 1
    wordCount <- wordCount + lineLength
    if ( lineLength > maxLineLength ) {
      maxLineLength <- lineLength
    }
  }
  close(con)
  return(c(wordCount,lineCount,maxLineLength))
}
blogStats <- summaryStats(blogFile)
newsStats <- summaryStats(newsFile)
## Warning in readLines(con, n = 1, skipNul = TRUE): incomplete final line
## found on './Data/final/en_US/en_US.news.txt'
twitterStats <- summaryStats(twitterFile)

Table 1. Summary statistics of each dataset

Dataset Features Blog Dataset News Dataset Twitter Dataset
Total words 3.938684410^{7} 2.83620410^{6} 3.287405210^{7}
Total lines 8.9928810^{5} 7.725910^{4} 2.36014810^{6}
Longest line 6852 1522 63

Task 3: Generating a histogram of most used words in the datasets

Before the most frequently used words can be determined, the datasets need to be loaded permanently into R and preprocessed. Since we determined the datasets were very large in Task 2, we will take a 1% random sample of each dataset as a representative dataset and write it to a new dataset.

set.seed (1111)
readBlog <- readLines(blogFile)
readNews <- readLines(newsFile)
readTwitter <- readLines(twitterFile)
sampledBlog <- sample(readBlog,length(readBlog)*0.01)
sampledNews <- sample(readNews,length(readNews)*0.01)
sampledTwitter <- sample(readTwitter,length(readTwitter)*0.01)
sampledBlogFile <- "./Data/final/en_US/Sampled/en_US.blogs_Sampled.txt"
sampledNewsFile <- "./Data/final/en_US/Sampled/en_US.news_Sampled.txt"
sampledTwitterFile <- "./Data/final/en_US/Sampled/en_US.twitter_Sampled.txt"
write(sampledBlog,sampledBlogFile)
write(sampledNews,sampledNewsFile)
write(sampledTwitter,sampledTwitterFile)

Next, the sampled datasets are loaded into R using the text mining library and processed to remove numbers, punctuations, upper caps, english stop words (e.g. a, the, and, but), extraneous white spaces, and common word endings (like ‘es’ and ‘ing’).

library(tm)
## Loading required package: NLP
docs  <-Corpus(DirSource("./Data/final/en_US/Sampled"), readerControl = list(language="lat")) 
summary(docs)    #check what documents were loaded into the corpus
##                           Length Class             Mode
## en_US.blogs_Sampled.txt   2      PlainTextDocument list
## en_US.news_Sampled.txt    2      PlainTextDocument list
## en_US.twitter_Sampled.txt 2      PlainTextDocument list
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))  
docs <- tm_map(docs , stripWhitespace)
docs <- tm_map(docs, stemDocument, language = "english")    #stemming removes common word endings

The next step is to create the document term matrix (DTM), which is a matrix that lists all occurences of words in the corpus by document. This will effectively extract summary statistics of the documents. The frequencies of words are summed across all documents and the six most frequently words are listed below and words with frequencies above 1000 are presented in Figure 1.

dtm <-DocumentTermMatrix(docs) 
freqr <- colSums(as.matrix(dtm))  #sum frequencies of words across documents
ordr <- order(freqr,decreasing=TRUE)  #sort word frequency list by descending order
freqr[head(ordr)]   #inspect most frequently occuring words
## just  get like  one will love 
## 2576 2545 2425 2294 2235 1931
wf=data.frame(term=names(freqr),occurrences=freqr)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
p <- ggplot(subset(wf, freqr>1000), aes(term, occurrences))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p

Conclusions

  1. One of the findings in this assignment is the difficulty of handling large input datasets in R. Moving forward we need to consider performance of the code and how to optimize the handling of data.
  2. The next steps to building the prediction model is to incorporate the functionality to recognize n-gram words and using this functionality to train the prediction model and predict words based on a number of n-gram sizes. The selection of the optimal n-gram size needs further consideration to optimize the performance of the code in light of the findings in the first conclusion.