Summary

The overall goal of this project is to develop a text prediction algorithm based on the analysis of large corpus of text documents.

The purpose of this milestone report is to show the way I have process and organized the data. I have been given three files of text - blogs, news, twitter. I will show in this report the way in which I have handled the text/data, some exploratory analysis and some discussion of how I will handle the rest of the project.

About the Data

The data is originally from HC Corpora (http://www.corpora.heliohost.org/) and the data sets for this project comes from a collection of written text in different languages (Russian, Finnish, German and English) of news articles, blog posts and tweets. This milestone report has been built on the English data set.

Loading and Processing the Data

Preparing

Initially I’ve loaded all the libraries I am using or plan to use. There is also a function to replace the contractions within each document. The process is not perfect. If a word is possessive, it’s still going to replace the “’s”" with " is“. I believe this will have a negligible effect on the end result in the data.

setwd("~/Capstone/en_US")
library(slam); library(stringr); library(tm); library(SnowballC); library(data.table); library(Matrix); library(textreg); library(wordcloud); library(ggplot2); library(stylo); library(gridExtra);#t

processContractions <- function(vector){
      
      vector <- gsub("can't", "cannot", vector, perl=T)
      vector <- gsub("I'm", "I am", vector, perl=T)
      vector <- gsub("'ll", "will", vector, perl=T)
      vector <- gsub("'ve", " have", vector, perl=T)
      vector <- gsub("'d", " had", vector, perl=T)
      vector <- gsub("n't", " not", vector, perl=T)
      vector <- gsub("'s", " is", vector, perl=T)
      return(vector)
}

Loading the English Data

For this part of the process I have loaded 10% of each data file and replaced the contractions.

#bloglines <- NROW(readLines("en_US.blogs.txt")) #899,288
blogs <- readLines("en_US.blogs.txt", n = 90000, skipNul = TRUE)
blogs <- iconv(blogs, to="ASCII", sub = "")
blogs <- processContractions(blogs)

#newslines <- NROW(readLines("en_US.news.txt")) #77,259
news <- readLines("en_US.news.txt", n=8000, skipNul = TRUE) 
news <- iconv(news, to="ASCII", sub = "")
news <- processContractions(news)

#twitterlines <- NROW(readLines("en_US.twitter.txt")) #2,360,148
twitter <- readLines("en_US.twitter.txt", n=240000, skipNul = TRUE) 
twitter <- iconv(twitter, to="ASCII", sub = "")
twitter <- processContractions(twitter)

In this section, I combined the loaded lines and converted them to a corpus to be used in the tm package. The package removes punctuation, numbers, lowers all the text to lower case, removes a loaded file of profanity, strips the leftover white space and ensures the corpus maintains being a plain text document. The list of profanity was obtained from Google (http://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/). It contains a list of 550 “words” that have been collected and are banned in Google forums.

combo <- c(blogs, twitter, news)
combo <- VCorpus(VectorSource(combo))

combo <- tm_map(combo, removePunctuation)
combo <- tm_map(combo, removeNumbers)
combo <- tm_map(combo, content_transformer(tolower))

badwords <- readLines("fullbadwordlist.txt")

combo <- tm_map(combo, removeWords, badwords)
combo <- tm_map(combo, stripWhitespace)
combo <- tm_map(combo, PlainTextDocument)

This section creates a matrix of all the words/terms in the processed combo file and creates a matrix of those terms. Then I subset that matrix based on words that were used more than 20 times.

combo_dtm <- DocumentTermMatrix(combo)

colTotals <- as.matrix(col_sums(combo_dtm))
combo_dtm_over20 <- combo_dtm[,which(colTotals > 20)]

This section of data separates all the words used in the files using the stylo package. The ngrams (sets of words) created using the built in make.ngram function. I created a data frame with single words and sets of two, three, and four. Commented in the code is the number of ngrams of each size - still using just 10% of the given data.

for_ngrams <- txt.to.words(c(blogs, twitter, news))
#remove bad words here too
#for_ngrams <- removeBadWords(for_ngrams) Write a function, similar to contractions one

ngrams_one <- make.ngrams(for_ngrams, ngram.size = 1)
ngrams_two <- make.ngrams(for_ngrams, ngram.size = 2)
ngrams_three <- make.ngrams(for_ngrams, ngram.size = 3)
ngrams_four <- make.ngrams(for_ngrams, ngram.size = 4)

df_one <- as.data.frame(table(ngrams_one))     #136,980
df_two <- as.data.frame(table(ngrams_two))     #2,033,618
df_three <- as.data.frame(table(ngrams_three)) #4,884,393
df_four <- as.data.frame(table(ngrams_four))   #6,452,701

df_one <- df_one[order(df_one$Freq, decreasing = TRUE),]
df_two <- df_two[order(df_two$Freq, decreasing = TRUE),]
df_three <- df_three[order(df_three$Freq, decreasing = TRUE),]
df_four <- df_three[order(df_four$Freq, decreasing = TRUE),]

sub_df_one <- subset(df_one, Freq > 5)         #31,790
sub_df_two <- subset(df_two, Freq > 5)         #129,806
sub_df_three <- subset(df_three, Freq > 5)     #87,916
sub_df_four <- subset(df_four, Freq > 5)       #25,352

The following are graphs of the top 20 words/phrases in 1-Ngrams, 2-Ngrams, 3-Ngrams.

Milestone Report

Camille P

July 20, 2015

Summary

About the Data

Loading and Processing the Data

Preparing

Loading the English Data

Next Steps