Introduction

This milestone report explores the HC Corpora dataset. This report is part of the data science capstone project provided through Coursera by John Hopkins University and Swiftkey.

Before we start analysis, the required libraries are attached.

# Libraries
library(knitr)
library(tm)
library(ggplot2, warn.conflicts=F)
library(wordcloud)
library(ngram)

Extract, Load and Transform data

The following code downloads and extracts the data (when not already available). The US English corpus is loaded and some summary statistics are calculated. The next section defines a random sample of the text based on 1% of the content of each of the three files. The sample text is cleaned by removing punctuation, numbers and extra white space.

# Download and unzip
dataset <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("SwiftKey.zip")) download.file(dataset, destfile="SwiftKey.zip")
if (!file.exists("final/")) unzip("Coursera-SwiftKey.zip")

# Load data and generate sample
folder <- "final/en_US/"
files <- dir(folder)

props <- data.frame(name=files, size=round(file.info(paste0(folder, files))$size/1024^2), lines=NA, words=NA, sample=NA)
textSample <- vector()

set.seed(1969)
for (corpus in files) {
  tekst <- readLines(paste0(folder, corpus), skipNul=T)
  props$lines[props$name==corpus] <- length(tekst)
  props$words[props$name==corpus] <- wordcount(tekst)
  train <- round(0.005*length(tekst))
  props$sample[props$name==corpus] <- train
  textSample <- c(textSample, sample(tekst, train))
}

# Display results
props$lines <- prettyNum(props$lines, big.mark=",")
props$words <- prettyNum(props$words, big.mark=",")
props$sample <- prettyNum(props$sample, big.mark=",")
kable(props, col.names=c("File", "Size [MB]", "Lines", "Word count", "Sample size"), align=c("l", rep("r", 4)))

File	Size [MB]	Lines	Word count	Sample size
en_US.blogs.txt	200	899,288	37,334,131	4,496
en_US.news.txt	196	1,010,242	34,372,530	5,051
en_US.twitter.txt	159	2,360,148	30,373,583	11,801

# Convert sample to tm corpus and clean data
textSample <- iconv(textSample, to="utf-8", sub="")
textSample <- textSample[!is.na(textSample)]
SampleCorpus <- Corpus(VectorSource(textSample))
SampleCorpus <- tm_map(SampleCorpus, removePunctuation)
SampleCorpus <- tm_map(SampleCorpus, removeNumbers)
SampleCorpus <- tm_map(SampleCorpus, stripWhitespace)
SampleCorpus <- tm_map(SampleCorpus, content_transformer(tolower))

Visualise the dataset

A Term-Document Matrix, based on single, double and triple words (n-grams), is generated to visualise the most common terms. The function finds the most frequent n-grams and visualises these.

biGramTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min=2, max=2))
triGramTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min=3, max=3))

wordMatrix <- TermDocumentMatrix(SampleCorpus)
#biGramMatrix <- TermDocumentMatrix(SampleCorpus, control=list(tokenize=biGramTokenizer))
#triGramMatrix <- TermDocumentMatrix(SampleCorpus, control=list(tokenize=triGramTokenizer))

# Visualise sample
wordplot <- function(data,f) {
  freqTerms <- findFreqTerms(data, lowfreq=f)
  termFrequency <- rowSums(as.matrix(data[freqTerms,]))
  termFrequency <- data.frame(ngram=names(termFrequency), frequency=termFrequency)
  ggplot(termFrequency, aes(x=reorder(ngram, frequency), y=frequency)) + 
    geom_bar(stat = "identity") +  coord_flip() +
    xlab("Word") + ylab("Frequency") + labs(title = "Top Words by Frequency") 
  }

Single words

wordplot(wordMatrix, 1200)

Conclusion

This analysis visualises the high level of occurrence of stop-word in the English language. These were not filtered from the text because they are required to be able to predict text. The discovered patterns can be used to generate a text prediction model. When analysing a corpus for meaning the stop-words should be removed.

Capstone Project - Milestone Report

Peter Prevos

27 November 2016

Introduction

Extract, Load and Transform data

Visualise the dataset

Single words

Conclusion