This milestone report explores the HC Corpora dataset. This report is part of the data science capstone project provided through Coursera by John Hopkins University and Swiftkey.
Before we start analysis, the required libraries are attached.
# Libraries
library(knitr)
library(tm)
library(ggplot2, warn.conflicts=F)
library(wordcloud)
library(ngram)
The following code downloads and extracts the data (when not already available). The US English corpus is loaded and some summary statistics are calculated. The next section defines a random sample of the text based on 1% of the content of each of the three files. The sample text is cleaned by removing punctuation, numbers and extra white space.
# Download and unzip
dataset <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("SwiftKey.zip")) download.file(dataset, destfile="SwiftKey.zip")
if (!file.exists("final/")) unzip("Coursera-SwiftKey.zip")
# Load data and generate sample
folder <- "final/en_US/"
files <- dir(folder)
props <- data.frame(name=files, size=round(file.info(paste0(folder, files))$size/1024^2), lines=NA, words=NA, sample=NA)
textSample <- vector()
set.seed(1969)
for (corpus in files) {
tekst <- readLines(paste0(folder, corpus), skipNul=T)
props$lines[props$name==corpus] <- length(tekst)
props$words[props$name==corpus] <- wordcount(tekst)
train <- round(0.005*length(tekst))
props$sample[props$name==corpus] <- train
textSample <- c(textSample, sample(tekst, train))
}
# Display results
props$lines <- prettyNum(props$lines, big.mark=",")
props$words <- prettyNum(props$words, big.mark=",")
props$sample <- prettyNum(props$sample, big.mark=",")
kable(props, col.names=c("File", "Size [MB]", "Lines", "Word count", "Sample size"), align=c("l", rep("r", 4)))
File | Size [MB] | Lines | Word count | Sample size |
---|---|---|---|---|
en_US.blogs.txt | 200 | 899,288 | 37,334,131 | 4,496 |
en_US.news.txt | 196 | 1,010,242 | 34,372,530 | 5,051 |
en_US.twitter.txt | 159 | 2,360,148 | 30,373,583 | 11,801 |
# Convert sample to tm corpus and clean data
textSample <- iconv(textSample, to="utf-8", sub="")
textSample <- textSample[!is.na(textSample)]
SampleCorpus <- Corpus(VectorSource(textSample))
SampleCorpus <- tm_map(SampleCorpus, removePunctuation)
SampleCorpus <- tm_map(SampleCorpus, removeNumbers)
SampleCorpus <- tm_map(SampleCorpus, stripWhitespace)
SampleCorpus <- tm_map(SampleCorpus, content_transformer(tolower))
A Term-Document Matrix, based on single, double and triple words (n-grams), is generated to visualise the most common terms. The function finds the most frequent n-grams and visualises these.
biGramTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min=2, max=2))
triGramTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min=3, max=3))
wordMatrix <- TermDocumentMatrix(SampleCorpus)
#biGramMatrix <- TermDocumentMatrix(SampleCorpus, control=list(tokenize=biGramTokenizer))
#triGramMatrix <- TermDocumentMatrix(SampleCorpus, control=list(tokenize=triGramTokenizer))
# Visualise sample
wordplot <- function(data,f) {
freqTerms <- findFreqTerms(data, lowfreq=f)
termFrequency <- rowSums(as.matrix(data[freqTerms,]))
termFrequency <- data.frame(ngram=names(termFrequency), frequency=termFrequency)
ggplot(termFrequency, aes(x=reorder(ngram, frequency), y=frequency)) +
geom_bar(stat = "identity") + coord_flip() +
xlab("Word") + ylab("Frequency") + labs(title = "Top Words by Frequency")
}
wordplot(wordMatrix, 1200)
This analysis visualises the high level of occurrence of stop-word in the English language. These were not filtered from the text because they are required to be able to predict text. The discovered patterns can be used to generate a text prediction model. When analysing a corpus for meaning the stop-words should be removed.