This is a milestone report for the Data Science Capstone 003, which will discuss with you, dear reader, the following points:
In which our hero wrestles with processing time
I felt the best demonstration that I am successfully reading in the data is providing my code for Cleaning the data and generating my summary 5grams (from which I can then easily generate 4 through 1 grams). See appendix 1 for generating the raw mass of ngrams for each corpus, Appendix 2 for generating the ngrams.
Explaining what is going on:
I downloaded the zip compressed data from the Coursera site, which contained the English (supposedly U.S.) corpuses of data gathered from Twitter, Blogs, and News websites. I say supposedly U.S. because searching the corpus for local New Zealand place names not found in the United States revealed several items unrelated to the US, so hereafter I will just describe this as an English corpus.
I preformed initial text cleanup with the bash utilities tr and sed, for reasons of overall speed. On a Mac or Linux machine you should be able to execute commands like this by calling system in R, under Windows I believe the commands are available in git bash but am unsure of how to call them from R so they might need to be run independently.
The ultimate goal of the process was to prepare the data for rapid ngram creation (taking under 50 minutes to go from raw data to a “long” form of the data with one word (being a word or word like entry) per line, each sentence being marked by a set of consecutive placeholder values that are large enough that when ngrams are formed sentences do not connect. Most special characters we converted to the Δ symbol to indicate one or more unusual characters in sequence. Sentence subclause markers such as parentheses and brackets were removed. Then every word token was shift to a different line (with the multiline ϴ symbols marking sentence breaks) so that the entire file was organised into an easy to read list for making ngrams (collections of consecutive words)
Summarise all the things
For readability answers have been dropped straight into the text as in-line code rather than interspaced code blocks.
Because of my data cleaning before reading the data in, my answers for how many lines and words (due to processing each token onto a separate line), and characters (due to deleting, combining, and adding multiple characters for sentence breaks) are, or at least should be, very different to other peoples:
You may like to know
Twitter contains more words/lines but a smaller range of ngrams. This suggests that people on Twitter do not express things in as wide a range of ways as people in the other sources.
The average word length of Twitter is shorter than that of Blogs or News.
Blogs and News are very similar to each other
This can be seen in a plot of how frequent unique ngrams (those with a frequency of one) are compared to more frequent ngrams. While the data has a very long tail, it converges by 6 entries sharing the same ngram, so the x axis can be abbreviated.
The Twitter data is highlighted in blue- the fewer unique and almost unique entries show a different pattern to the similar news and blogs data.
off into the wild blue yonder
After determining the most valuable ngrams to keep for the purposes of prediction, the heart of my web app is having lists of what entries each word appears in. the intersection of the sets for the words is the position of the prediction, as demonstrated in the following code:
mysample <- read.table(header=TRUE, text="
n4 n3 n2 n1 frequent
j g g e 1
h i d d 1
a c g d 1
g f f a 1
h j i e 1
h h h e 1
f j a j 3
j d f a 1
b c f i 2
c c d a 1
j a j c 2
d a a j 1")
position <- 1:nrow(mysample)
n4list <- tapply(position,mysample$n4,c)
n3list <- tapply(position,mysample$n3,c)
n2list <- tapply(position,mysample$n2,c)
predictions <- mysample$n1
#example prediction "f j a..."
predictions[intersect(n4list[["f"]], intersect(n3list[["j"]],n2list[["a"]]))]
My reasons for this:
library(lubridate)
preptext <- function(which){
setwd("~/")
call <- paste("cp ~/en_US/",which," working1.txt",sep="")
system(call)
system("tr '[:upper:]' '[:lower:]' < working1.txt > exam.txt")
system("tr -d '\\000' < exam.txt > better.txt")
system("sed -i -e 's/[123456780]/9/g' better.txt")
system("sed -i -e 's/[%$\\(\\)}{\"<>]//g' better.txt")
system("sed -i -e 's/\\[//g' better.txt")
system("sed -i -e 's/]//g' better.txt")
#converting any Δ so it can be a placeholder
system("sed -i -e 's/Δ/Ξ/g' better.txt")
#converting any ϴ so it can be a placeholder
system("sed -i -e 's/ϴ/Ξ/g' better.txt")
#end of sentances go to 4 (ngrams-1) ϴ
system("sed -i -e 's/[\\!?\\:]/ ϴ ϴ ϴ ϴ /g' better.txt")
#to keep st. as abreviation convert to stΔ
system("sed -i -e 's/ st\\. / stΔ /g' better.txt")
#convert . to ϴ ϴ ϴ ϴ
system("sed -i -e 's/\\./ ϴ ϴ ϴ ϴ /g' better.txt")
#convert stΔ to st.
system("sed -i -e 's/ stΔ / st. /g' better.txt")
#convert symbols I don't want to predict with to Ξ
system("sed -i -e \"s/[^[:alpha:]9#@.'[:space:]ϴ-]/Ξ/g\" better.txt")
#convert linebreaks to ϴ ϴ ϴ ϴ
system("tr -d '\\r' < better.txt > best.txt")
system("tr '\\n' ' ϴ ϴ ϴ ϴ ' < best.txt > final.txt")
system("sed -i -e 's/ΞΞ*/Ξ/g' final.txt")
system("sed -i -e 's/[:space:]\\+/ /g' final.txt")
system("sed -i -e 's/ /\\
/g' final.txt")
call <- paste("cp final.txt ~/en_US_cleaned/",which,sep="")
system(call)
}
checkpoint1=now()
preptext("en_US.blogs.txt")
preptext("en_US.news.txt")
preptext("en_US.twitter.txt")
checkpoint2=now()
checkpoint2-checkpoint1
library(data.table)
library(dplyr)
library(lubridate)
sumgrams <-function(myfile){
tokens <- readLines(paste("~/en_US_cleaned/",myfile,sep=""))
tokens[tokens == "ϴ"] <- NA
checkpoint1 <- now()
tokens <- c(character(5 - length(tokens) %% 5), tokens)
steps <- seq(from=5, to=length(tokens), by=5)
eoseq <- steps + (length(tokens) - max(steps))
p = c(tokens[eoseq], tokens[eoseq-1], tokens[eoseq-2], tokens[eoseq-3], tokens[eoseq-4])
n2 = c(tokens[eoseq-1], tokens[eoseq-2], tokens[eoseq-3], tokens[eoseq-4], (c(NA, tokens[eoseq]))[1:length(steps)])
n3 = c(tokens[eoseq-2], tokens[eoseq-3], tokens[eoseq-4], (c(NA, tokens[eoseq]))[1:length(steps)], (c(NA, tokens[eoseq-1]))[1:length(steps)])
n4 = c(tokens[eoseq-3], tokens[eoseq-4], (c(NA, tokens[eoseq]))[1:length(steps)], (c(NA, tokens[eoseq-1]))[1:length(steps)], (c(NA, tokens[eoseq-2]))[1:length(steps)])
n5 = c(tokens[eoseq-4], (c(NA, tokens[eoseq]))[1:length(steps)], (c(NA, tokens[eoseq-1]))[1:length(steps)], (c(NA, tokens[eoseq-2]))[1:length(steps)], (c(NA, tokens[eoseq-3]))[1:length(steps)])
dt <- data.table(n5,n4,n3,n2,p)
ngramfreq <- dt %>% group_by(p, n2, n3, n4, n5) %>% summarise(freq = n())
saveRDS(ngramfreq, file=paste("~/en_US_agg/",myfile,".RDS",sep=""), compress=FALSE)
checkpoint2 <- now()
print(checkpoint2 - checkpoint1)
}
sumgrams("en_US.twitter.txt")
sumgrams("en_US.blogs.txt")
sumgrams("en_US.news.txt")