The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.
The data provided came in four languages:
In each language, there were three sources of data:
These are pretty large files:
# Get file size, in MB
paths$size.MB <- lapply(paths$path, function(x) file.info(x)$size/1000000)
# Get number of lines in each file
getLineCount <- function(path)
{
sys.str <- system(str_c('wc -l ', path), intern=TRUE) # Call UNIX wc command
lineCount <- as.numeric(str_split(sys.str,'./')[[1]][1]) # Split and trim count
return(lineCount)
}
paths$lines <- lapply(paths$path, getLineCount)
# Print out table
kable(paths)
| lang | source | path | size.MB | lines |
|---|---|---|---|---|
| en_US | blogs | ./final/en_US/en_US.blogs.txt | 210.16 | 899288 |
| en_US | news | ./final/en_US/en_US.news.txt | 205.8119 | 1010242 |
| en_US | ./final/en_US/en_US.twitter.txt | 167.1053 | 2360148 | |
| de_DE | blogs | ./final/de_DE/de_DE.blogs.txt | 85.45967 | 371440 |
| de_DE | news | ./final/de_DE/de_DE.news.txt | 95.59196 | 244743 |
| de_DE | ./final/de_DE/de_DE.twitter.txt | 75.57834 | 947774 | |
| fi_FI | blogs | ./final/fi_FI/fi_FI.blogs.txt | 108.5036 | 439785 |
| fi_FI | news | ./final/fi_FI/fi_FI.news.txt | 94.23435 | 485758 |
| fi_FI | ./final/fi_FI/fi_FI.twitter.txt | 25.33114 | 285214 | |
| ru_RU | blogs | ./final/ru_RU/ru_RU.blogs.txt | 116.8558 | 337100 |
| ru_RU | news | ./final/ru_RU/ru_RU.news.txt | 118.9964 | 196360 |
| ru_RU | ./final/ru_RU/ru_RU.twitter.txt | 105.1823 | 881414 |
For the sake of just exploring the data, we will sample 10,000 lines out of each data set, and focus on the en_US language.
# Read in data to table
N = 10000
lines <- data.table()
lines <- rbind(lines, data.table(source=as.factor('blogs'),
raw=readLines(file(paths[lang=="en_US" & source=="blogs"]$path, open="r"), n=N)))
lines <- rbind(lines, data.table(source=as.factor('news'),
raw=readLines(file(paths[lang=="en_US" & source=="news"]$path, open="r"), n=N)))
lines <- rbind(lines, data.table(source=as.factor('twitter'),
raw=readLines(file(paths[lang=="en_US" & source=="twitter"]$path, open="r"), n=N)))
# Format and clean data
lines$formatted <- tolower(lines$raw)
lines$formatted <- removePunctuation(lines$formatted)
lines$formatted <- removeNumbers(lines$formatted)
# Show example of formatted data
kable(lines[sample(nrow(lines), 5)])
| source | raw | formatted |
|---|---|---|
| Darn! I took UK plus 50. | darn i took uk plus | |
| “The minute I’m out of town / My friends get sick, go back on the sauce / Engage in unhappy love affairs (Philip Whalen) | the minute im out of town my friends get sick go back on the sauce engage in unhappy love affairs philip whalen | |
| blogs | In itself, the tale of the publication of Into the Cannibal’s Pot: Lessons For America From Post-Apartheid South Africa bears telling. For while this polemic respects no political totems or taboos, it is faithful to facts. These facts cried out to be chronicled. They should not have had a struggle to find their way into print. | in itself the tale of the publication of into the cannibals pot lessons for america from postapartheid south africa bears telling for while this polemic respects no political totems or taboos it is faithful to facts these facts cried out to be chronicled they should not have had a struggle to find their way into print |
| news | A: At 38 degrees below zero! | a at degrees below zero |
| Shite. Damn me & my cash flow issues. | shite damn me my cash flow issues |
Now let’s take a look at some of the characteristics of these data. We can look at the frequency of individual words in the dataset by using termFreq in the tm package.
# Some words are more frequent than others - what are the distributions of word
# frequencies?
mft <- as.data.table(termFreq(lines$formatted))
mft <- mft[order(mft$N, decreasing=TRUE)]
colnames(mft) <- c('term','N')
mft$term <- factor(mft$term, levels = mft$term[order(-mft$N)])
ggplot(mft[1:50], aes(x=term, y=N, fill=N)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title='Most Frequent Terms')
The tm package allows us to build a corpus structure for text analysis. Using tokenizers on the corpus, we can explore two- and three-word phrases by constructing bi- and tri-gram tokenizers.
# Build tokenizers
twoGramTokenizer <- function(x, n)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
threeGramTokenizer <- function(x, n)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
# Build corpus and term document matricies, remove sparse terms
corpus <- VCorpus(VectorSource(lines$formatted))
bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = twoGramTokenizer))
bigrams <- removeSparseTerms(bigrams, 0.999)
trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = threeGramTokenizer))
trigrams <- removeSparseTerms(trigrams, 0.999)
# Build data tables with most frequent terms
bigrams.mft <- rowSums(as.matrix(bigrams))
bigrams.mft <- data.table(gram=names(bigrams.mft),
freq=bigrams.mft)[order(-freq)]
trigrams.mft <- rowSums(as.matrix(trigrams))
trigrams.mft <- data.table(gram=names(trigrams.mft),
freq=trigrams.mft)[order(-freq)]
ggplot(bigrams.mft[1:50], aes(x=reorder(gram, -freq), y=freq, fill=freq)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title='Most Frequent Bigrams')
wordcloud(bigrams.mft$gram,bigrams.mft$freq,max.words=100,random.order = F)
ggplot(trigrams.mft[1:50], aes(x=reorder(gram, -freq), y=freq, fill=freq)) + geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title='Most Frequent Trigrams')
wordcloud(trigrams.mft$gram,trigrams.mft$freq,max.words=100,random.order = F)
The next steps for this project are to create a predictive shiny app that will accept a user’s word and predict the next word they will use. For the model, I suspect that I will build a model using the bigrams shown above. The challenge will be to make a model responsive enough to parse the entire corpus quickly when a user enters input.