For this capstone project, we must develop a mechanism for predicting words based on a collection of text files drawn from blogs, news sources, and Twitter.
In this milestone report, I will share my progress so far in analyzing the text files provided for this project.
First I will load the necessary packages for this project.
setwd("C:/Users/19732/Downloads")
library(tokenizers)
library(dplyr)
library(ggplot2)
# Here I want to acknowledge Lincoln Mullen, the author of the 'tokenizers' package.
It would significantly reduce computing time to use the full files provided to us, and it is likely not needed for the necessary degree of statistical accuracy to load all rows of data. For this project I’m going to use 15,000 entries each from the blogs, news, and twitter data sets.
From each set, I’ll use 10,000 entries for the training set, 2,500 for the validation set, and 2,500 for the test set. I do not use the validation set or the test set in this milestone report but I anticipate using them in the future.
First we load in the data.
blogsData <- read.delim("en_US.blogs.txt", header=FALSE, nrows=15000)
newsData <- read.delim("en_US.news.txt", header=FALSE, nrows=15000)
twitterData <- read.delim("en_US.twitter.txt", header=FALSE, nrows=15000)
Next, we assign lines of text to the training, validation, and test sets.
set.seed(630)
rows <- sample(nrow(blogsData))
blogsData <- data.frame(blogsData[rows, ])
blogsData$index <- c(1:nrow(blogsData))
blogsTraining <- blogsData %>% subset(index <= 10000)
blogsValidate <- blogsData %>% subset(index > 10000 & index <= 12500)
blogsTesting <- blogsData %>% subset(index > 12500)
rows <- sample(nrow(newsData))
newsData <- data.frame(newsData[rows, ])
newsData$index <- c(1:nrow(newsData))
newsTraining <- newsData %>% subset(index <= 10000)
newsValidate <- newsData %>% subset(index > 10000 & index <= 12500)
newsTesting <- newsData %>% subset(index > 12500)
rows <- sample(nrow(twitterData))
twitterData <- data.frame(twitterData[rows, ])
twitterData$index <- c(1:nrow(twitterData))
twitterTraining <- twitterData %>% subset(index <= 10000)
twitterValidate <- twitterData %>% subset(index > 10000 & index <= 12500)
twitterTesting <- twitterData %>% subset(index > 12500)
blogsTesting <- data.frame(blogsTesting[, 1])
newsTesting <- data.frame(newsTesting[, 1])
twitterTesting <- data.frame(twitterTesting[, 1])
# These last lines remove from the training sets the index columns used for dividing into new sets.
We now have a training set, a validation set, and a test set for each of our three data sources, for a total of nine sets. We will focus on the training set for the remainder of this report.
A quick look at the data will show that it consists of one column of text entries. As we get to know the data, we should find out how long each of these entries is. I will derive the character length of each entry and then run some basic statistics.
names(blogsTraining) <- c("Text")
names(newsTraining) <- c("Text")
names(twitterTraining) <- c("Text")
blogsTraining$length <- nchar(blogsTraining$Text)
newsTraining$length <- nchar(newsTraining$Text)
twitterTraining$length <- nchar(twitterTraining$Text)
max(blogsTraining$length)
## [1] 291892
max(newsTraining$length)
## [1] 97465
max(twitterTraining$length)
## [1] 105736
mean(blogsTraining$length)
## [1] 460.4719
mean(newsTraining$length)
## [1] 386.0463
mean(twitterTraining$length)
## [1] 192.6931
We can see here that the longest entry is found in the blogs data set and is nearly 300,000 characters long. However, most of the entries are less than 400 characters long. Below I’ve included a boxplot to show the distributions of character lengths.
blogsTraining$platform <- rep("Blogs", nrow(blogsTraining))
newsTraining$platform <- rep("News", nrow(newsTraining))
twitterTraining$platform <- rep("Twitter", nrow(blogsTraining))
allTraining <- rbind(blogsTraining, newsTraining, twitterTraining)
ylimit = boxplot.stats(allTraining$length)$stats[c(1, 5)]
ggplot(data=allTraining, mapping=aes(x=platform, y=length))+geom_boxplot()+xlab("Platform")+ylab("Character length of text line")+labs(title="Character length of text lines in three platforms", subtitle="Outliers over 1,5000 character removed from view")+ coord_cartesian(ylim = ylimit*3)
In this project we’re interested in individual words and groups of words. I’ll write a function that takes one of our loaded text files and returns a tokenized version of it, listing all individual words in that file.
createWordList <- function(dataframe1){
for(i in 1:nrow(dataframe1)){
if(i==1){
bigWords <- data.frame(tokenize_words(dataframe1[i, 1]))
names(bigWords) <- c("Words")
}
else{
newWords <- data.frame(tokenize_words(dataframe1[i, 1]))
names(newWords) <- c("Words")
bigWords <- rbind(bigWords, newWords)
}
}
return(bigWords)
}
Now I’ll write a function that takes one of our loaded text files and returns a tokenized version of it, listing all 1-grams, 2-grams, and 3-grams, or whatever maximum N the user desires for their N-grams.
createNGramList <- function(dataframe1, minGram, maxGram){
for(i in 1:nrow(dataframe1)){
if(i==1){
bigGrams <- data.frame(tokenize_ngrams(dataframe1[i, 1], n=maxGram, n_min=minGram))
names(bigGrams) <- c("Grams")
}
else{
newGrams <- data.frame(tokenize_ngrams(dataframe1[i, 1], n=maxGram, n_min=minGram))
names(newGrams) <- c("Grams")
bigGrams <- rbind(bigGrams, newGrams)
}
}
return(bigGrams)
}
Another requirement for cleaning this data is to literally clean it by removing profanity. I’ll now write a function that makes necessary replacements. Through this function, I’m removing six profane words. I’ve included spaces before and after each word to ensure that I do not remove a non-profane word that does not contain one of these sequences of letters.
watchLanguage <- function(dataframe1){
# First we remove instances of profanity in the middle of sentences.
dataframe1[, 1] <- sub("( [Aa][Ss][Ss] )|( [Ss][Hh][Ii][Tt] )|( [Ff][Uu][Cc][Kk] )|( [Dd][Aa][Mm][Nn] )|( [Bb][Ii][Tt][Cc][Hh] )|( [Bb][Aa][Ss][Tt][Aa][Rr][Dd] )", " [EXPLETIVE] ", dataframe1[, 1], ignore.case=TRUE)
# Next we remove these instances of profanity at the start of sentences.
dataframe1[, 1] <- sub("(^[Aa][Ss][Ss] )|(^[Ss][Hh][Ii][Tt] )|(^[Ff][Uu][Cc][Kk] )|(^[Dd][Aa][Mm][Nn] )|(^[Bb][Ii][Tt][Cc][Hh] )|(^[Bb][Aa][Ss][Tt][Aa][Rr][Dd] )", "[EXPLETIVE] ", dataframe1[, 1], ignore.case=TRUE)
# Next we remove these words at the end of sentences.
dataframe1[, 1] <- sub("( [Aa][Ss][Ss]$)|( [Ss][Hh][Ii][Tt]$)|( [Ff][Uu][Cc][Kk]$)|( [Dd][Aa][Mm][Nn]$)|( [Bb][Ii][Tt][Cc][Hh]$)|( [Bb][Aa][Ss][Tt][Aa][Rr][Dd]$)", " [EXPLETIVE]", dataframe1[, 1], ignore.case=TRUE)
# Finally I'll remove these words when they are the full sentence.
dataframe1[, 1] <- sub("(^[Aa][Ss][Ss]$)|(^[Ss][Hh][Ii][Tt]$)|(^[Ff][Uu][Cc][Kk]$)|(^[Dd][Aa][Mm][Nn]$)|(^[Bb][Ii][Tt][Cc][Hh]$)|(^[Bb][Aa][Ss][Tt][Aa][Rr][Dd]$)", "[EXPLETIVE]", dataframe1[, 1], ignore.case=TRUE)
return(dataframe1)
}
I’ll apply our profanity filter function to all our training sets.
blogsTraining <- watchLanguage(blogsTraining)
newsTraining <- watchLanguage(newsTraining)
twitterTraining <- watchLanguage(twitterTraining)
To start exploring the frequency of words, I’ll tokenize each file word-wise using the function I previously wrote.
blogsWords <- createWordList(blogsTraining)
newsWords <- createWordList(newsTraining)
twitterWords <- createWordList(twitterTraining)
Now I’ll create N-grams of each file using the function I had written, using 2-word N-grams.
blogsGrams <- createNGramList(blogsTraining, minGram=2, maxGram=2)
newsGrams <- createNGramList(newsTraining, minGram=2, maxGram=2)
twitterGrams <- createNGramList(twitterTraining, minGram=2, maxGram=2)
Let’s now examine the number of words in each file.
nrow(blogsWords)
## [1] 841432
nrow(newsWords)
## [1] 663994
nrow(twitterWords)
## [1] 355882
And let’s find the number of unique words in each file.
nrow(unique(blogsWords))
## [1] 47520
nrow(unique(newsWords))
## [1] 44074
nrow(unique(twitterWords))
## [1] 28803
The next step is to make frequency tables for the words.
blogWordFreq <- data.frame(table(blogsWords$Words))
newsWordFreq <- data.frame(table(newsWords$Words))
twitterWordFreq <- data.frame(table(twitterWords$Words))
Next I’ll make frequency tables for the N-Grams.
blogGramFreq <- data.frame(table(blogsGrams$Grams))
newsGramFreq <- data.frame(table(newsGrams$Grams))
twitterGramFreq <- data.frame(table(twitterGrams$Grams))
Now let’s sort all these lists to find the most common words and N-grams.
blogWordFreq <- blogWordFreq %>% arrange(desc(Freq))
newsWordFreq <- newsWordFreq %>% arrange(desc(Freq))
twitterWordFreq <- twitterWordFreq %>% arrange(desc(Freq))
blogGramFreq <- blogGramFreq %>% arrange(desc(Freq))
newsGramFreq <- newsGramFreq %>% arrange(desc(Freq))
twitterGramFreq <- twitterGramFreq %>% arrange(desc(Freq))
head(blogWordFreq)
## Var1 Freq
## 1 the 40973
## 2 and 24065
## 3 to 23392
## 4 a 19672
## 5 of 19219
## 6 i 16812
head(newsWordFreq)
## Var1 Freq
## 1 the 37323
## 2 to 17279
## 3 and 17019
## 4 a 16578
## 5 of 14656
## 6 in 12777
head(twitterWordFreq)
## Var1 Freq
## 1 the 11044
## 2 to 9182
## 3 i 8474
## 4 a 7207
## 5 you 6502
## 6 and 5146
head(blogGramFreq)
## Var1 Freq
## 1 of the 4153
## 2 in the 3439
## 3 to the 1920
## 4 on the 1716
## 5 to be 1520
## 6 and the 1283
head(newsGramFreq)
## Var1 Freq
## 1 of the 3381
## 2 in the 3347
## 3 to the 1574
## 4 on the 1339
## 5 for the 1289
## 6 at the 1173
head(twitterGramFreq)
## Var1 Freq
## 1 in the 933
## 2 for the 855
## 3 of the 685
## 4 to be 550
## 5 on the 547
## 6 to the 535
First I’ll merge these six data sets into two, one for words and the other for 2-grams.
blogWordFreq$Platform <- rep("Blogs", nrow(blogWordFreq))
newsWordFreq$Platform <- rep("News", nrow(newsWordFreq))
twitterWordFreq$Platform <- rep("Twitter", nrow(twitterWordFreq))
blogGramFreq$Platform <- rep("Blogs", nrow(blogGramFreq))
newsGramFreq$Platform <- rep("News", nrow(newsGramFreq))
twitterGramFreq$Platform <- rep("Twitter", nrow(twitterGramFreq))
topBlogWordFreq <- blogWordFreq[1:10, ]
topNewsWordFreq <- newsWordFreq[1:10, ]
topTwitterWordFreq <- twitterWordFreq[1:10, ]
topBlogGramFreq <- blogGramFreq[1:10, ]
topNewsGramFreq <- newsGramFreq[1:10, ]
topTwitterGramFreq <- twitterGramFreq[1:10, ]
topWordFreq <- rbind(topBlogWordFreq, topNewsWordFreq, topTwitterWordFreq)
topGramFreq <- rbind(topBlogGramFreq, topNewsGramFreq, topTwitterGramFreq)
Now I’ll chart the top words and 2-grams.
ggplot(data=topWordFreq, mapping=aes(x=Var1, y=Freq))+geom_col()+facet_grid(Platform ~ .)+xlab("Word")+ylab("Frequency")+labs(title="Frequency of most common words in blogs, news, and Twitter files", subtitle="Lack of bar only indicates that the two-gram was not top 10 for the given platform, not that it was not observed.")
ggplot(data=topGramFreq, mapping=aes(x=Var1, y=Freq))+geom_col()+facet_grid(Platform ~ .)+xlab("Two-Gram")+ylab("Frequency")+labs(title="Frequency of most common two-grams in blogs, news, and Twitter files", subtitle="Lack of bar only indicates that the two-gram was not top 10 for the given platform, not that it was not observed.")
So far I have found that the most common words and 2-grams are articles, pronouns, and versatile phrases such as “I love” or “Thanks for”.
I am hoping to build an app that accepts a term and uses some of what I’ve learned about n-grams to predict the most common subsequent words. I would also like to help the user see not just the best-fitting result but the second, third, or fourth best-fitting result as well. For the algorithm, I am interested in the use of classification trees but am concerned about how to keep computing time manageable.