Data Science Captstone Milestone Report

Overview

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

Data Import and Formatting

The data provided came in four languages:

German
English (US)
Finnish
Russian

In each language, there were three sources of data:

Twitter
News articles
Blogs

These are pretty large files:

# Get file size, in MB
paths$size.MB <- lapply(paths$path, function(x) file.info(x)$size/1000000)

# Get number of lines in each file
getLineCount <- function(path)
{
  sys.str <- system(str_c('wc -l ', path), intern=TRUE) # Call UNIX wc command
  lineCount <- as.numeric(str_split(sys.str,'./')[[1]][1]) # Split and trim count
  return(lineCount)
}
paths$lines <- lapply(paths$path, getLineCount)

# Print out table
kable(paths)

lang	source	path	size.MB	lines
en_US	blogs	./final/en_US/en_US.blogs.txt	210.16	899288
en_US	news	./final/en_US/en_US.news.txt	205.8119	1010242
en_US	twitter	./final/en_US/en_US.twitter.txt	167.1053	2360148
de_DE	blogs	./final/de_DE/de_DE.blogs.txt	85.45967	371440
de_DE	news	./final/de_DE/de_DE.news.txt	95.59196	244743
de_DE	twitter	./final/de_DE/de_DE.twitter.txt	75.57834	947774
fi_FI	blogs	./final/fi_FI/fi_FI.blogs.txt	108.5036	439785
fi_FI	news	./final/fi_FI/fi_FI.news.txt	94.23435	485758
fi_FI	twitter	./final/fi_FI/fi_FI.twitter.txt	25.33114	285214
ru_RU	blogs	./final/ru_RU/ru_RU.blogs.txt	116.8558	337100
ru_RU	news	./final/ru_RU/ru_RU.news.txt	118.9964	196360
ru_RU	twitter	./final/ru_RU/ru_RU.twitter.txt	105.1823	881414

For the sake of just exploring the data, we will sample 10,000 lines out of each data set, and focus on the en_US language.

# Read in data to table
N = 10000
lines <- data.table()
lines <- rbind(lines, data.table(source=as.factor('blogs'), 
               raw=readLines(file(paths[lang=="en_US" & source=="blogs"]$path, open="r"), n=N)))
lines <- rbind(lines, data.table(source=as.factor('news'), 
               raw=readLines(file(paths[lang=="en_US" & source=="news"]$path, open="r"), n=N)))
lines <- rbind(lines, data.table(source=as.factor('twitter'), 
               raw=readLines(file(paths[lang=="en_US" & source=="twitter"]$path, open="r"), n=N)))

# Format and clean data
lines$formatted <- tolower(lines$raw)
lines$formatted <- removePunctuation(lines$formatted)
lines$formatted <- removeNumbers(lines$formatted)

# Show example of formatted data
kable(lines[sample(nrow(lines), 5)])

source	raw	formatted
twitter	Darn! I took UK plus 50.	darn i took uk plus
twitter	“The minute I’m out of town / My friends get sick, go back on the sauce / Engage in unhappy love affairs (Philip Whalen)	the minute im out of town my friends get sick go back on the sauce engage in unhappy love affairs philip whalen
blogs	In itself, the tale of the publication of Into the Cannibal’s Pot: Lessons For America From Post-Apartheid South Africa bears telling. For while this polemic respects no political totems or taboos, it is faithful to facts. These facts cried out to be chronicled. They should not have had a struggle to find their way into print.	in itself the tale of the publication of into the cannibals pot lessons for america from postapartheid south africa bears telling for while this polemic respects no political totems or taboos it is faithful to facts these facts cried out to be chronicled they should not have had a struggle to find their way into print
news	A: At 38 degrees below zero!	a at degrees below zero
twitter	Shite. Damn me & my cash flow issues.	shite damn me my cash flow issues

Exploratory Data Analysis

Now let’s take a look at some of the characteristics of these data. We can look at the frequency of individual words in the dataset by using termFreq in the tm package.

# Some words are more frequent than others - what are the distributions of word 
# frequencies?
mft <- as.data.table(termFreq(lines$formatted))
mft <- mft[order(mft$N, decreasing=TRUE)]
colnames(mft) <- c('term','N')
mft$term <- factor(mft$term, levels = mft$term[order(-mft$N)])
ggplot(mft[1:50], aes(x=term, y=N, fill=N)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title='Most Frequent Terms')

The tm package allows us to build a corpus structure for text analysis. Using tokenizers on the corpus, we can explore two- and three-word phrases by constructing bi- and tri-gram tokenizers.

# Build tokenizers
twoGramTokenizer <- function(x, n)
  unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
threeGramTokenizer <- function(x, n)
  unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

# Build corpus and term document matricies, remove sparse terms
corpus <- VCorpus(VectorSource(lines$formatted))
bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = twoGramTokenizer))
bigrams  <- removeSparseTerms(bigrams, 0.999)
trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = threeGramTokenizer))
trigrams <- removeSparseTerms(trigrams, 0.999)

# Build data tables with most frequent terms
bigrams.mft <- rowSums(as.matrix(bigrams))
bigrams.mft <- data.table(gram=names(bigrams.mft),
                                 freq=bigrams.mft)[order(-freq)]
trigrams.mft <- rowSums(as.matrix(trigrams))
trigrams.mft <- data.table(gram=names(trigrams.mft),
                                 freq=trigrams.mft)[order(-freq)]

ggplot(bigrams.mft[1:50], aes(x=reorder(gram, -freq), y=freq, fill=freq)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title='Most Frequent Bigrams')

wordcloud(bigrams.mft$gram,bigrams.mft$freq,max.words=100,random.order = F)

ggplot(trigrams.mft[1:50], aes(x=reorder(gram, -freq), y=freq, fill=freq)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title='Most Frequent Trigrams')

wordcloud(trigrams.mft$gram,trigrams.mft$freq,max.words=100,random.order = F)

Conclusion

The next steps for this project are to create a predictive shiny app that will accept a user’s word and predict the next word they will use. For the model, I suspect that I will build a model using the bigrams shown above. The challenge will be to make a model responsive enough to parse the entire corpus quickly when a user enters input.

Data Science Captstone Milestone Report

Jonathan Kunze

10/27/2017

Overview

Data Import and Formatting

Exploratory Data Analysis

Conclusion