Capstone Milestone Report

For this capstone project, we must develop a mechanism for predicting words based on a collection of text files drawn from blogs, news sources, and Twitter.

In this milestone report, I will share my progress so far in analyzing the text files provided for this project.

Loading data, creating training set

First I will load the necessary packages for this project.

setwd("C:/Users/19732/Downloads")
library(tokenizers)
library(dplyr)
library(ggplot2)
# Here I want to acknowledge Lincoln Mullen, the author of the 'tokenizers' package.

It would significantly reduce computing time to use the full files provided to us, and it is likely not needed for the necessary degree of statistical accuracy to load all rows of data. For this project I’m going to use 15,000 entries each from the blogs, news, and twitter data sets.

From each set, I’ll use 10,000 entries for the training set, 2,500 for the validation set, and 2,500 for the test set. I do not use the validation set or the test set in this milestone report but I anticipate using them in the future.

First we load in the data.

blogsData  <- read.delim("en_US.blogs.txt", header=FALSE, nrows=15000)
newsData <- read.delim("en_US.news.txt", header=FALSE, nrows=15000)
twitterData <- read.delim("en_US.twitter.txt", header=FALSE, nrows=15000)

Next, we assign lines of text to the training, validation, and test sets.

set.seed(630)
rows <- sample(nrow(blogsData))
blogsData <- data.frame(blogsData[rows, ])
blogsData$index <- c(1:nrow(blogsData))
blogsTraining <- blogsData %>% subset(index <= 10000)
blogsValidate <- blogsData %>% subset(index > 10000 & index <= 12500)
blogsTesting <- blogsData %>% subset(index > 12500)

rows <- sample(nrow(newsData))
newsData <- data.frame(newsData[rows, ])
newsData$index <- c(1:nrow(newsData))
newsTraining <- newsData %>% subset(index <= 10000)
newsValidate <- newsData %>% subset(index > 10000 & index <= 12500)
newsTesting <- newsData %>% subset(index > 12500)

rows <- sample(nrow(twitterData))
twitterData <- data.frame(twitterData[rows, ])
twitterData$index <- c(1:nrow(twitterData))
twitterTraining <- twitterData %>% subset(index <= 10000)
twitterValidate <- twitterData %>% subset(index > 10000 & index <= 12500)
twitterTesting <- twitterData %>% subset(index > 12500)

blogsTesting <- data.frame(blogsTesting[, 1])
newsTesting <- data.frame(newsTesting[, 1])
twitterTesting <- data.frame(twitterTesting[, 1])
# These last lines remove from the training sets the index columns used for dividing into new sets.

We now have a training set, a validation set, and a test set for each of our three data sources, for a total of nine sets. We will focus on the training set for the remainder of this report.

Exploratory analysis: Line length

A quick look at the data will show that it consists of one column of text entries. As we get to know the data, we should find out how long each of these entries is. I will derive the character length of each entry and then run some basic statistics.

names(blogsTraining) <- c("Text")
names(newsTraining) <- c("Text")
names(twitterTraining) <- c("Text")

blogsTraining$length <- nchar(blogsTraining$Text)
newsTraining$length <- nchar(newsTraining$Text)
twitterTraining$length <- nchar(twitterTraining$Text)

max(blogsTraining$length)

## [1] 291892

max(newsTraining$length)

## [1] 97465

max(twitterTraining$length)

## [1] 105736

mean(blogsTraining$length)

## [1] 460.4719

mean(newsTraining$length)

## [1] 386.0463

mean(twitterTraining$length)

## [1] 192.6931

We can see here that the longest entry is found in the blogs data set and is nearly 300,000 characters long. However, most of the entries are less than 400 characters long. Below I’ve included a boxplot to show the distributions of character lengths.

blogsTraining$platform <- rep("Blogs", nrow(blogsTraining))
newsTraining$platform <- rep("News", nrow(newsTraining))
twitterTraining$platform <- rep("Twitter", nrow(blogsTraining))
allTraining <- rbind(blogsTraining, newsTraining, twitterTraining)
ylimit = boxplot.stats(allTraining$length)$stats[c(1, 5)]
ggplot(data=allTraining, mapping=aes(x=platform, y=length))+geom_boxplot()+xlab("Platform")+ylab("Character length of text line")+labs(title="Character length of text lines in three platforms", subtitle="Outliers over 1,5000 character removed from view")+ coord_cartesian(ylim = ylimit*3)

Tokenization

In this project we’re interested in individual words and groups of words. I’ll write a function that takes one of our loaded text files and returns a tokenized version of it, listing all individual words in that file.

createWordList <- function(dataframe1){
  for(i in 1:nrow(dataframe1)){
    if(i==1){
      bigWords <- data.frame(tokenize_words(dataframe1[i, 1])) 
      names(bigWords) <- c("Words")
    }  
    else{
      newWords <- data.frame(tokenize_words(dataframe1[i, 1]))
      names(newWords) <- c("Words")
      bigWords <- rbind(bigWords, newWords)
    }
  }
  return(bigWords)
}

Now I’ll write a function that takes one of our loaded text files and returns a tokenized version of it, listing all 1-grams, 2-grams, and 3-grams, or whatever maximum N the user desires for their N-grams.

createNGramList <- function(dataframe1, minGram, maxGram){
  for(i in 1:nrow(dataframe1)){
    if(i==1){
       bigGrams <- data.frame(tokenize_ngrams(dataframe1[i, 1], n=maxGram, n_min=minGram)) 
       names(bigGrams) <- c("Grams")
    }  
    else{
    newGrams <- data.frame(tokenize_ngrams(dataframe1[i, 1], n=maxGram, n_min=minGram))
    names(newGrams) <- c("Grams")
    bigGrams <- rbind(bigGrams, newGrams)
    }
  }
  return(bigGrams)
}

Profanity filtering

Another requirement for cleaning this data is to literally clean it by removing profanity. I’ll now write a function that makes necessary replacements. Through this function, I’m removing six profane words. I’ve included spaces before and after each word to ensure that I do not remove a non-profane word that does not contain one of these sequences of letters.

watchLanguage <- function(dataframe1){
  # First we remove instances of profanity in the middle of sentences. 
  dataframe1[, 1] <- sub("( [Aa][Ss][Ss] )|( [Ss][Hh][Ii][Tt] )|( [Ff][Uu][Cc][Kk] )|( [Dd][Aa][Mm][Nn] )|( [Bb][Ii][Tt][Cc][Hh] )|( [Bb][Aa][Ss][Tt][Aa][Rr][Dd] )", " [EXPLETIVE] ", dataframe1[, 1], ignore.case=TRUE)
  # Next we remove these instances of profanity at the start of sentences. 
  dataframe1[, 1] <- sub("(^[Aa][Ss][Ss] )|(^[Ss][Hh][Ii][Tt] )|(^[Ff][Uu][Cc][Kk] )|(^[Dd][Aa][Mm][Nn] )|(^[Bb][Ii][Tt][Cc][Hh] )|(^[Bb][Aa][Ss][Tt][Aa][Rr][Dd] )", "[EXPLETIVE] ", dataframe1[, 1], ignore.case=TRUE)
  # Next we remove these words at the end of sentences. 
  dataframe1[, 1] <- sub("( [Aa][Ss][Ss]$)|( [Ss][Hh][Ii][Tt]$)|( [Ff][Uu][Cc][Kk]$)|( [Dd][Aa][Mm][Nn]$)|( [Bb][Ii][Tt][Cc][Hh]$)|( [Bb][Aa][Ss][Tt][Aa][Rr][Dd]$)", " [EXPLETIVE]", dataframe1[, 1], ignore.case=TRUE)
  # Finally I'll remove these words when they are the full sentence. 
  dataframe1[, 1] <- sub("(^[Aa][Ss][Ss]$)|(^[Ss][Hh][Ii][Tt]$)|(^[Ff][Uu][Cc][Kk]$)|(^[Dd][Aa][Mm][Nn]$)|(^[Bb][Ii][Tt][Cc][Hh]$)|(^[Bb][Aa][Ss][Tt][Aa][Rr][Dd]$)", "[EXPLETIVE]", dataframe1[, 1], ignore.case=TRUE)
  return(dataframe1)
}

I’ll apply our profanity filter function to all our training sets.

blogsTraining <- watchLanguage(blogsTraining)
newsTraining <- watchLanguage(newsTraining)
twitterTraining <- watchLanguage(twitterTraining)

Exploratory analysis: Most common words and phrases

To start exploring the frequency of words, I’ll tokenize each file word-wise using the function I previously wrote.

blogsWords <- createWordList(blogsTraining)
newsWords <- createWordList(newsTraining)
twitterWords <- createWordList(twitterTraining)

Now I’ll create N-grams of each file using the function I had written, using 2-word N-grams.

blogsGrams <- createNGramList(blogsTraining, minGram=2, maxGram=2)
newsGrams <- createNGramList(newsTraining, minGram=2, maxGram=2)
twitterGrams <- createNGramList(twitterTraining, minGram=2, maxGram=2)

Let’s now examine the number of words in each file.

nrow(blogsWords)

## [1] 841432

nrow(newsWords)

## [1] 663994

nrow(twitterWords)

## [1] 355882

And let’s find the number of unique words in each file.

nrow(unique(blogsWords))

## [1] 47520

nrow(unique(newsWords))

## [1] 44074

nrow(unique(twitterWords))

## [1] 28803

The next step is to make frequency tables for the words.

blogWordFreq <- data.frame(table(blogsWords$Words))
newsWordFreq <- data.frame(table(newsWords$Words))
twitterWordFreq <- data.frame(table(twitterWords$Words))

Next I’ll make frequency tables for the N-Grams.

blogGramFreq <- data.frame(table(blogsGrams$Grams))
newsGramFreq <- data.frame(table(newsGrams$Grams))
twitterGramFreq <- data.frame(table(twitterGrams$Grams))

Now let’s sort all these lists to find the most common words and N-grams.

blogWordFreq <- blogWordFreq %>% arrange(desc(Freq))
newsWordFreq <- newsWordFreq %>% arrange(desc(Freq))
twitterWordFreq <- twitterWordFreq %>% arrange(desc(Freq))
blogGramFreq <- blogGramFreq %>% arrange(desc(Freq))
newsGramFreq <- newsGramFreq %>% arrange(desc(Freq))
twitterGramFreq <- twitterGramFreq %>% arrange(desc(Freq))

head(blogWordFreq)

##   Var1  Freq
## 1  the 40973
## 2  and 24065
## 3   to 23392
## 4    a 19672
## 5   of 19219
## 6    i 16812

head(newsWordFreq)

##   Var1  Freq
## 1  the 37323
## 2   to 17279
## 3  and 17019
## 4    a 16578
## 5   of 14656
## 6   in 12777

head(twitterWordFreq)

##   Var1  Freq
## 1  the 11044
## 2   to  9182
## 3    i  8474
## 4    a  7207
## 5  you  6502
## 6  and  5146

head(blogGramFreq)

##      Var1 Freq
## 1  of the 4153
## 2  in the 3439
## 3  to the 1920
## 4  on the 1716
## 5   to be 1520
## 6 and the 1283

head(newsGramFreq)

##      Var1 Freq
## 1  of the 3381
## 2  in the 3347
## 3  to the 1574
## 4  on the 1339
## 5 for the 1289
## 6  at the 1173

head(twitterGramFreq)

##      Var1 Freq
## 1  in the  933
## 2 for the  855
## 3  of the  685
## 4   to be  550
## 5  on the  547
## 6  to the  535

Charting most common words and 2-grams

First I’ll merge these six data sets into two, one for words and the other for 2-grams.

blogWordFreq$Platform <- rep("Blogs", nrow(blogWordFreq))
newsWordFreq$Platform <- rep("News", nrow(newsWordFreq))
twitterWordFreq$Platform <- rep("Twitter", nrow(twitterWordFreq))

blogGramFreq$Platform <- rep("Blogs", nrow(blogGramFreq))
newsGramFreq$Platform <- rep("News", nrow(newsGramFreq))
twitterGramFreq$Platform <- rep("Twitter", nrow(twitterGramFreq))

topBlogWordFreq <- blogWordFreq[1:10, ]
topNewsWordFreq <- newsWordFreq[1:10, ]
topTwitterWordFreq <- twitterWordFreq[1:10, ]
topBlogGramFreq <- blogGramFreq[1:10, ]
topNewsGramFreq <- newsGramFreq[1:10, ]
topTwitterGramFreq <- twitterGramFreq[1:10, ]

topWordFreq <- rbind(topBlogWordFreq, topNewsWordFreq, topTwitterWordFreq)
topGramFreq <- rbind(topBlogGramFreq, topNewsGramFreq, topTwitterGramFreq)

Now I’ll chart the top words and 2-grams.

ggplot(data=topWordFreq, mapping=aes(x=Var1, y=Freq))+geom_col()+facet_grid(Platform ~ .)+xlab("Word")+ylab("Frequency")+labs(title="Frequency of most common words in blogs, news, and Twitter files", subtitle="Lack of bar only indicates that the two-gram was not top 10 for the given platform, not that it was not observed.")

ggplot(data=topGramFreq, mapping=aes(x=Var1, y=Freq))+geom_col()+facet_grid(Platform ~ .)+xlab("Two-Gram")+ylab("Frequency")+labs(title="Frequency of most common two-grams in blogs, news, and Twitter files", subtitle="Lack of bar only indicates that the two-gram was not top 10 for the given platform, not that it was not observed.")

So far I have found that the most common words and 2-grams are articles, pronouns, and versatile phrases such as “I love” or “Thanks for”.

Goals for my app

I am hoping to build an app that accepts a term and uses some of what I’ve learned about n-grams to predict the most common subsequent words. I would also like to help the user see not just the best-fitting result but the second, third, or fourth best-fitting result as well. For the algorithm, I am interested in the use of classification trees but am concerned about how to keep computing time manageable.