The goal of this report is to explore the datasets that will be used in developing a prediction algorithm for Natural Language Processing. Three datasets are provided, each containing large amount of texts obtained from online blogs, news and twitter feeds. This report captures a summary statistics of each dataset and any interesting findings from the exploratory analysis. The final part of the report will introduce our plan for creating a prediction algorithm and Shiny app for our product.
Three datasets are provided, each containing large amount of texts obtained from online blogs, news and twitter feeds. In this report, we will:
Throughout the report, any interesting findings during the exploratory analysis will be recorded.
The datasets are downloaded from the provided link in the course instructions.
setwd("~//R//CapstoneProject")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url,destfile = "./Data/Coursera-SwiftKey.zip")
unzip(zipfile="./Data/Coursera-SwiftKey.zip",exdir="./Data")
blogFile <- "./Data/final/en_US/en_US.blogs.txt"
newsFile <- "./Data/final/en_US/en_US.news.txt"
twitterFile <- "./Data/final/en_US/en_US.twitter.txt"
There are three files representing three datasets for blogs, news and twitter feeds. They contain large amounts of data of around 200MB. The data in the files are read in one line at a time temporarily to obtain statistics of these datasets.
To better understand the datasets, we examine three key features in each dataset - total number of words, number of lines in the file, and the length of the longest line. The length of the longest line is equivalent to the largest number of words on a single line. The summary statistics for each dataset are presented in Table 1.
#Create function to open a given file and calculate the total word count in files
wordCount = 0
lineCount = 0
maxLineLength = 0
summaryStats = function(filepath) {
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1, skipNul = TRUE)
if ( length(line) == 0 ) {
break
}
lineCount <- lineCount + 1
lineLength <- sapply(gregexpr("\\W+", line), length) + 1
wordCount <- wordCount + lineLength
if ( lineLength > maxLineLength ) {
maxLineLength <- lineLength
}
}
close(con)
return(c(wordCount,lineCount,maxLineLength))
}
blogStats <- summaryStats(blogFile)
newsStats <- summaryStats(newsFile)
## Warning in readLines(con, n = 1, skipNul = TRUE): incomplete final line
## found on './Data/final/en_US/en_US.news.txt'
twitterStats <- summaryStats(twitterFile)
Table 1. Summary statistics of each dataset
| Dataset Features | Blog Dataset | News Dataset | Twitter Dataset |
|---|---|---|---|
| Total words | 3.938684410^{7} | 2.83620410^{6} | 3.287405210^{7} |
| Total lines | 8.9928810^{5} | 7.725910^{4} | 2.36014810^{6} |
| Longest line | 6852 | 1522 | 63 |
Before the most frequently used words can be determined, the datasets need to be loaded permanently into R and preprocessed. Since we determined the datasets were very large in Task 2, we will take a 1% random sample of each dataset as a representative dataset and write it to a new dataset.
set.seed (1111)
readBlog <- readLines(blogFile)
readNews <- readLines(newsFile)
readTwitter <- readLines(twitterFile)
sampledBlog <- sample(readBlog,length(readBlog)*0.01)
sampledNews <- sample(readNews,length(readNews)*0.01)
sampledTwitter <- sample(readTwitter,length(readTwitter)*0.01)
sampledBlogFile <- "./Data/final/en_US/Sampled/en_US.blogs_Sampled.txt"
sampledNewsFile <- "./Data/final/en_US/Sampled/en_US.news_Sampled.txt"
sampledTwitterFile <- "./Data/final/en_US/Sampled/en_US.twitter_Sampled.txt"
write(sampledBlog,sampledBlogFile)
write(sampledNews,sampledNewsFile)
write(sampledTwitter,sampledTwitterFile)
Next, the sampled datasets are loaded into R using the text mining library and processed to remove numbers, punctuations, upper caps, english stop words (e.g. a, the, and, but), extraneous white spaces, and common word endings (like ‘es’ and ‘ing’).
library(tm)
## Loading required package: NLP
docs <-Corpus(DirSource("./Data/final/en_US/Sampled"), readerControl = list(language="lat"))
summary(docs) #check what documents were loaded into the corpus
## Length Class Mode
## en_US.blogs_Sampled.txt 2 PlainTextDocument list
## en_US.news_Sampled.txt 2 PlainTextDocument list
## en_US.twitter_Sampled.txt 2 PlainTextDocument list
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs , stripWhitespace)
docs <- tm_map(docs, stemDocument, language = "english") #stemming removes common word endings
The next step is to create the document term matrix (DTM), which is a matrix that lists all occurences of words in the corpus by document. This will effectively extract summary statistics of the documents. The frequencies of words are summed across all documents and the six most frequently words are listed below and words with frequencies above 1000 are presented in Figure 1.
dtm <-DocumentTermMatrix(docs)
freqr <- colSums(as.matrix(dtm)) #sum frequencies of words across documents
ordr <- order(freqr,decreasing=TRUE) #sort word frequency list by descending order
freqr[head(ordr)] #inspect most frequently occuring words
## just get like one will love
## 2576 2545 2425 2294 2235 1931
wf=data.frame(term=names(freqr),occurrences=freqr)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
p <- ggplot(subset(wf, freqr>1000), aes(term, occurrences))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p