A milestone not a millstone

This is a milestone report for the Data Science Capstone 003, which will discuss with you, dear reader, the following points:

Demonstrating I downloaded the data and successfully loaded it in.
Create a basic report of summary statistics about the data sets.
To explore strange new findings.
To boldly go on about my plans for creating a prediction algorithm and Shiny app.

Chapter 1, Downloading to Loading

In which our hero wrestles with processing time

I felt the best demonstration that I am successfully reading in the data is providing my code for Cleaning the data and generating my summary 5grams (from which I can then easily generate 4 through 1 grams). See appendix 1 for generating the raw mass of ngrams for each corpus, Appendix 2 for generating the ngrams.

Explaining what is going on:

I downloaded the zip compressed data from the Coursera site, which contained the English (supposedly U.S.) corpuses of data gathered from Twitter, Blogs, and News websites. I say supposedly U.S. because searching the corpus for local New Zealand place names not found in the United States revealed several items unrelated to the US, so hereafter I will just describe this as an English corpus.

I preformed initial text cleanup with the bash utilities tr and sed, for reasons of overall speed. On a Mac or Linux machine you should be able to execute commands like this by calling system in R, under Windows I believe the commands are available in git bash but am unsure of how to call them from R so they might need to be run independently.

The ultimate goal of the process was to prepare the data for rapid ngram creation (taking under 50 minutes to go from raw data to a “long” form of the data with one word (being a word or word like entry) per line, each sentence being marked by a set of consecutive placeholder values that are large enough that when ngrams are formed sentences do not connect. Most special characters we converted to the Δ symbol to indicate one or more unusual characters in sequence. Sentence subclause markers such as parentheses and brackets were removed. Then every word token was shift to a different line (with the multiline ϴ symbols marking sentence breaks) so that the entire file was organised into an easy to read list for making ngrams (collections of consecutive words)

Chapter 2, About

Summarise all the things

For readability answers have been dropped straight into the text as in-line code rather than interspaced code blocks.

Because of my data cleaning before reading the data in, my answers for how many lines and words (due to processing each token onto a separate line), and characters (due to deleting, combining, and adding multiple characters for sentence breaks) are, or at least should be, very different to other peoples:

Twitter (cleaned)

53846444 lines and words (used wc -l)
221239000 characters (used wc -c)
If working with 5 grams, there are 21377162 different ngrams, and 20182632 occur once
If working with 4 grams, there are 19373633 different ngrams, and 17707560 occur once
If working with 3 grams, there are 13964340 different ngrams, and 11847723 occur once
If working with 2 grams, there are 5554755 different ngrams, and 4110116 occur once

Blogs (cleaned)

50450552 lines and words (used wc -l)
241290397 characters (used wc -c)
If working with 5 grams, there are 30611432 different ngrams, and 29444227 occur once
If working with 4 grams, there are 27336146 different ngrams, and 25311354 occur once
If working with 3 grams, there are 18818936 different ngrams, and 15999268 occur once
If working with 2 grams, there are 6954586 different ngrams, and 5083850 occur once

News (cleaned)

46055021 lines and words (used wc -l)
221239000 characters (used wc -c)
If working with 5 grams, there are 28142698 different ngrams, and 26699639 occur once
If working with 4 grams, there are 25401811 different ngrams, and 23216798 occur once
If working with 3 grams, there are 18004986 different ngrams, and 15079959 occur once
If working with 2 grams, there are 6789238 different ngrams, and 4807002 occur once

Chapter 3, Findings

You may like to know

Twitter contains more words/lines but a smaller range of ngrams. This suggests that people on Twitter do not express things in as wide a range of ways as people in the other sources.
The average word length of Twitter is shorter than that of Blogs or News.
Blogs and News are very similar to each other

This can be seen in a plot of how frequent unique ngrams (those with a frequency of one) are compared to more frequent ngrams. While the data has a very long tail, it converges by 6 entries sharing the same ngram, so the x axis can be abbreviated.

The Twitter data is highlighted in blue- the fewer unique and almost unique entries show a different pattern to the similar news and blogs data.

Chapter 4, The future

off into the wild blue yonder

After determining the most valuable ngrams to keep for the purposes of prediction, the heart of my web app is having lists of what entries each word appears in. the intersection of the sets for the words is the position of the prediction, as demonstrated in the following code:

mysample <- read.table(header=TRUE, text="
n4 n3 n2 n1 frequent
j  g  g  e        1
h  i  d  d        1
a  c  g  d        1
g  f  f  a        1
h  j  i  e        1
h  h  h  e        1
f  j  a  j        3
j  d  f  a        1
b  c  f  i        2
c  c  d  a        1
j  a  j  c        2
d  a  a  j        1")

position <- 1:nrow(mysample)

n4list <- tapply(position,mysample$n4,c)
n3list <- tapply(position,mysample$n3,c)
n2list <- tapply(position,mysample$n2,c)
predictions <- mysample$n1
#example prediction "f j a..."
predictions[intersect(n4list[["f"]], intersect(n3list[["j"]],n2list[["a"]]))]

My reasons for this:

lists and number sets are highly efficient
it can be made more even faster by converting to environment variables, creating an O(1) hash lookup
if there are no exact matches it is easy to build up a set of partial matches by exploring different intersect combinations.

Appendix 1 : Processing code:

library(lubridate)

preptext <- function(which){
setwd("~/")
call <- paste("cp ~/en_US/",which," working1.txt",sep="")
system(call)
system("tr '[:upper:]' '[:lower:]' < working1.txt > exam.txt")
system("tr -d '\\000'  < exam.txt > better.txt")
system("sed -i -e 's/[123456780]/9/g' better.txt")
system("sed -i -e 's/[%$\\(\\)}{\"<>]//g' better.txt")
system("sed -i -e 's/\\[//g' better.txt")
system("sed -i -e 's/]//g' better.txt")
#converting any Δ so it can be a placeholder
system("sed -i -e 's/Δ/Ξ/g' better.txt")
#converting any ϴ so it can be a placeholder
system("sed -i -e 's/ϴ/Ξ/g' better.txt")
#end of sentances go to 4 (ngrams-1) ϴ
system("sed -i -e 's/[\\!?\\:]/ ϴ ϴ ϴ ϴ /g' better.txt")
#to keep st. as abreviation convert to stΔ
system("sed -i -e 's/ st\\. / stΔ /g' better.txt")
#convert . to  ϴ ϴ ϴ ϴ 
system("sed -i -e 's/\\./ ϴ ϴ ϴ ϴ /g' better.txt")
#convert stΔ to st.
system("sed -i -e 's/ stΔ / st. /g' better.txt")
#convert symbols I don't want to predict with to Ξ
system("sed -i -e \"s/[^[:alpha:]9#@.'[:space:]ϴ-]/Ξ/g\" better.txt")
#convert linebreaks to  ϴ ϴ ϴ ϴ 
system("tr -d '\\r' < better.txt > best.txt")
system("tr '\\n' ' ϴ ϴ ϴ ϴ '   < best.txt > final.txt")
system("sed -i -e 's/ΞΞ*/Ξ/g' final.txt")
system("sed -i -e 's/[:space:]\\+/ /g' final.txt")
system("sed -i -e 's/ /\\
/g' final.txt")
call <- paste("cp  final.txt ~/en_US_cleaned/",which,sep="")
system(call)
}


checkpoint1=now()
preptext("en_US.blogs.txt")
preptext("en_US.news.txt")
preptext("en_US.twitter.txt")

checkpoint2=now()

checkpoint2-checkpoint1

Appendix 2: ngram frequncy

library(data.table)
library(dplyr)
library(lubridate)

sumgrams <-function(myfile){
tokens <- readLines(paste("~/en_US_cleaned/",myfile,sep=""))
tokens[tokens == "ϴ"] <- NA

checkpoint1 <- now()
tokens <- c(character(5 - length(tokens) %% 5), tokens)
steps <- seq(from=5, to=length(tokens), by=5)
eoseq <- steps + (length(tokens) - max(steps))

p = c(tokens[eoseq], tokens[eoseq-1], tokens[eoseq-2], tokens[eoseq-3], tokens[eoseq-4])
n2 = c(tokens[eoseq-1], tokens[eoseq-2], tokens[eoseq-3], tokens[eoseq-4], (c(NA, tokens[eoseq]))[1:length(steps)])
n3 = c(tokens[eoseq-2], tokens[eoseq-3], tokens[eoseq-4], (c(NA, tokens[eoseq]))[1:length(steps)], (c(NA, tokens[eoseq-1]))[1:length(steps)])
n4 = c(tokens[eoseq-3], tokens[eoseq-4], (c(NA, tokens[eoseq]))[1:length(steps)], (c(NA, tokens[eoseq-1]))[1:length(steps)], (c(NA, tokens[eoseq-2]))[1:length(steps)])
n5 = c(tokens[eoseq-4], (c(NA, tokens[eoseq]))[1:length(steps)], (c(NA, tokens[eoseq-1]))[1:length(steps)], (c(NA, tokens[eoseq-2]))[1:length(steps)], (c(NA, tokens[eoseq-3]))[1:length(steps)])


dt <- data.table(n5,n4,n3,n2,p)
ngramfreq <- dt %>% group_by(p, n2, n3, n4, n5) %>% summarise(freq = n())
saveRDS(ngramfreq, file=paste("~/en_US_agg/",myfile,".RDS",sep=""), compress=FALSE)

checkpoint2 <- now()
print(checkpoint2 - checkpoint1)
}

sumgrams("en_US.twitter.txt")
sumgrams("en_US.blogs.txt")
sumgrams("en_US.news.txt")