Introduction

This milestone report explores the HC Corpora dataset. This report is part of the data science capstone project provided through Coursera by John Hopkins University and Swiftkey.

Before we start analysis, the required libraries are attached.

# Libraries
library(knitr)
library(tm)
library(ggplot2, warn.conflicts=F)
library(wordcloud)
library(ngram)

Extract, Load and Transform data

The following code downloads and extracts the data (when not already available). The US English corpus is loaded and some summary statistics are calculated. The next section defines a random sample of the text based on 1% of the content of each of the three files. The sample text is cleaned by removing punctuation, numbers and extra white space.

# Download and unzip
dataset <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("SwiftKey.zip")) download.file(dataset, destfile="SwiftKey.zip")
if (!file.exists("final/")) unzip("Coursera-SwiftKey.zip")

# Load data and generate sample
folder <- "final/en_US/"
files <- dir(folder)

props <- data.frame(name=files, size=round(file.info(paste0(folder, files))$size/1024^2), lines=NA, words=NA, sample=NA)
textSample <- vector()

set.seed(1969)
for (corpus in files) {
  tekst <- readLines(paste0(folder, corpus), skipNul=T)
  props$lines[props$name==corpus] <- length(tekst)
  props$words[props$name==corpus] <- wordcount(tekst)
  train <- round(0.005*length(tekst))
  props$sample[props$name==corpus] <- train
  textSample <- c(textSample, sample(tekst, train))
}

# Display results
props$lines <- prettyNum(props$lines, big.mark=",")
props$words <- prettyNum(props$words, big.mark=",")
props$sample <- prettyNum(props$sample, big.mark=",")
kable(props, col.names=c("File", "Size [MB]", "Lines", "Word count", "Sample size"), align=c("l", rep("r", 4)))
File Size [MB] Lines Word count Sample size
en_US.blogs.txt 200 899,288 37,334,131 4,496
en_US.news.txt 196 1,010,242 34,372,530 5,051
en_US.twitter.txt 159 2,360,148 30,373,583 11,801
# Convert sample to tm corpus and clean data
textSample <- iconv(textSample, to="utf-8", sub="")
textSample <- textSample[!is.na(textSample)]
SampleCorpus <- Corpus(VectorSource(textSample))
SampleCorpus <- tm_map(SampleCorpus, removePunctuation)
SampleCorpus <- tm_map(SampleCorpus, removeNumbers)
SampleCorpus <- tm_map(SampleCorpus, stripWhitespace)
SampleCorpus <- tm_map(SampleCorpus, content_transformer(tolower))

Visualise the dataset

A Term-Document Matrix, based on single, double and triple words (n-grams), is generated to visualise the most common terms. The function finds the most frequent n-grams and visualises these.

biGramTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min=2, max=2))
triGramTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min=3, max=3))

wordMatrix <- TermDocumentMatrix(SampleCorpus)
#biGramMatrix <- TermDocumentMatrix(SampleCorpus, control=list(tokenize=biGramTokenizer))
#triGramMatrix <- TermDocumentMatrix(SampleCorpus, control=list(tokenize=triGramTokenizer))

# Visualise sample
wordplot <- function(data,f) {
  freqTerms <- findFreqTerms(data, lowfreq=f)
  termFrequency <- rowSums(as.matrix(data[freqTerms,]))
  termFrequency <- data.frame(ngram=names(termFrequency), frequency=termFrequency)
  ggplot(termFrequency, aes(x=reorder(ngram, frequency), y=frequency)) + 
    geom_bar(stat = "identity") +  coord_flip() +
    xlab("Word") + ylab("Frequency") + labs(title = "Top Words by Frequency") 
  }

Single words

wordplot(wordMatrix, 1200)

Conclusion

This analysis visualises the high level of occurrence of stop-word in the English language. These were not filtered from the text because they are required to be able to predict text. The discovered patterns can be used to generate a text prediction model. When analysing a corpus for meaning the stop-words should be removed.