This is the milestone report for Data Science Capstone Project. This report aims to explain only the major features of the data and briefly summarize my plans for creating the prediction algorithm. A large portion of this report deals with cleaning and preparing the data.
First, let’s load the data and take a random 10% sample from all three data sets. This will decrease overall computation time and allow for quicker experimentation with the data.
## Setting the seed for reproducibility.
set.seed(100)
## This function takes a random sample of the data.
sampler <- function(chunk, perc)
{
perc <- round(length(chunk)*perc)
sample.index <- sample(1:length(chunk), perc)
return(chunk[sample.index])
}
## Let's start off with a 10% sample
US_blogs <- readLines("./Data/final/en_US/en_US.blogs.txt")
US_blogs <- sampler(US_blogs, .10)
US_news <- readLines("./Data/final/en_US/en_US.news.txt")
US_news <- sampler(US_news, .10)
US_twitter <- readLines("./Data/final/en_US/en_US.twitter.txt")
US_twitter <- sampler(US_twitter, .10)
Let’s take a look at some basic descriptive measures of the size of the text.
library(knitr)
basic.measures <- cbind(
c("Text Chunks", "Characters"),
rbind(c(length(US_news),length(US_blogs),
length(US_twitter)),
c(sum(nchar(US_news)), sum(nchar(US_blogs)),
sum(nchar(US_twitter))))
)
colnames(basic.measures) <- c("Measure", "News", "Blogs", "Twitter")
kable(basic.measures)
| Measure | News | Blogs | |
|---|---|---|---|
| Text Chunks | 7726 | 89929 | 236015 |
| Characters | 1555444 | 20761203 | 16215333 |
This sampled data set is quite large and likely has a few things we can take out. For starters, profanity filtering is a good idea. I will reformat and combine two profanity lists first. Then, I will take out any text chunks that contain anything on the list. The lists used can be found here and here.
# https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
bad.words.1 <- readLines("./Bad_Word_List.1.txt")
bad.words.1 <- bad.words.1[-length(bad.words.1)]
# https://gist.githubusercontent.com/jamiew/1112488/raw/7ca9b1669e1c24b27c66174762cb04e14cf05aa7/google_twunter_lol
bad.words.2 <- readLines("./Bad_Word_List.2.txt")
bad.words.2 <- bad.words.2[-1]
bad.words.2 <- substr(x = bad.words.2, start = 1, stop = nchar(bad.words.2)-3)
double.quote.index <- grep(pattern = "\"", x = bad.words.2)
bad.words.2[double.quote.index] <- substr(x = bad.words.2[double.quote.index], start = 2,
stop = nchar(bad.words.2[double.quote.index])-1)
all.bad.words <- c(bad.words.1, bad.words.2)
all.bad.words <- unique(all.bad.words)
all.bad.words <- paste(all.bad.words, collapse="|")
all.bad.words <- substr(x = all.bad.words, start = 1, stop = nchar(all.bad.words)-1)
bad.words.twitter <- grep(all.bad.words, US_twitter)
bad.words.news <- grep(all.bad.words, US_news)
bad.words.blogs <- grep(all.bad.words, US_blogs)
bad.prop.twitter <- length(bad.words.twitter)/length(US_twitter)
bad.prop.news <- length(bad.words.news)/length(US_news)
bad.prop.blogs <- length(bad.words.blogs)/length(US_blogs)
US_twitter <- US_twitter[-bad.words.twitter]
US_news <- US_news[-bad.words.news]
US_blogs <- US_blogs[-bad.words.blogs]
Now let’s take a look at the size of the new data set and what percentage of the chunks had profanities.
basic.measures.2 <- cbind(
c("Text Chunks", "Characters", "Profanity Percent"),
rbind(c(length(US_news),length(US_blogs),
length(US_twitter)),
c(sum(nchar(US_news)), sum(nchar(US_blogs)),
sum(nchar(US_twitter))),
round(c(bad.prop.news, bad.prop.blogs,
bad.prop.twitter), 4))
)
colnames(basic.measures.2) <- c("Measure", "News", "Blogs", "Twitter")
kable(basic.measures.2)
| Measure | News | Blogs | |
|---|---|---|---|
| Text Chunks | 5924 | 68960 | 208519 |
| Characters | 1081361 | 12065112 | 13911641 |
| Profanity Percent | 0.2332 | 0.2332 | 0.1165 |
With the reduced clean data set, we can now focus on dealing with numbers and individual sentences. I decided to replace all of the numbers in the data with generic “xnumber” and “xdollar” strings. By trading the unique numbers for standard markers, I have created some homogeneity in the data. This will make it easier to count and analyze terms right before and after the markers.
The text chunks each contain multiple sentences and distinct word combinations. Simply taking out punctuation and merging the chunks will distort the true order of the words. Ideas are typically separated with periods, commas, semicolons, and colons. I decided to split the data on these to preserve the associations between groups of words and prevent unrelated words from merging.
The following code chunk creates functions that deal with generic numbers and sentence splitting. These are then combined and applied to the data.
## Replace numbers and currency with generic "xnumber" and "xdollars" markers
generic.numbers <- function(chunk)
{
chunk <- gsub(pattern = "[$](([0-9]+)|([,]*))+ +", replacement = "xdollars ",
x = chunk)
chunk <- gsub(pattern = "\\d+", replacement = "xnumber",
x = chunk)
chunk <- gsub(pattern = "xnumber,xnumber", replacement = "xnumber",
x = chunk)
return(chunk)
}
## Separate text into distinc sentences based on (. , ; :)
sentence.splitter <- function(chunk)
{
if(grepl("(*)[.]|[,]|[;];[:] [A-Z](*)", chunk))
return(strsplit(chunk, "(*)[.]|[,]|[;]|[:] (*)"))
else(return(chunk))
}
## Combining above functions to get cleaner lists of words
cleaner <- function(chunk)
{
clean.chunks <- vector(mode = "character")
for(i in 1:length(chunk))
{
clean.chunks <- c(clean.chunks,
sentence.splitter(generic.numbers(chunk[i]))[[1]])
}
#Remove leading and trailing whitespace
clean.chunks <- gsub("^\\s+|\\s+$", "", clean.chunks)
return(clean.chunks)
}
US_news_list <- cleaner(US_news); rm(US_news);
US_blogs_list <- cleaner(US_blogs); rm(US_blogs);
US_twitter_list <- cleaner(US_twitter); rm(US_twitter)
#Remove empty entries from the list
US_news_list <- US_news_list[US_news_list != ""]
US_blogs_list <- US_blogs_list[US_blogs_list != ""]
US_twitter_list <- US_twitter_list[US_twitter_list != ""]
With this reduced and consistent data structure, we can look at how often groups of words show up. I will focus on frequencies of individual terms, two terms, and three terms. These are known as n-grams and they are a common tool in computational linguistics. The following two functions create the n-grams and their frequency counts. These are applied to the twitter data to produce data for plotting.
## A function that creates ngrams
n.gram <- function(sentence, n)
{
sent <- strsplit(sentence, split = " ")
if(length(sent[[1]]) < n)
return()
ns <- vector(mode = "character", length = length(sent[[1]])-n+1)
for(i in 1:(length(sent[[1]])-n+1))
{
ns[i] <- paste((sent[[1]][i:(i+n-1)]), collapse = " ")
}
return(ns)
}
## N.Gram Example
n.gram(US_twitter_list[100], 2)
## [1] "I know" "know I" "I love" "love it"
## Returning a table of frequency counts for ngrams
n.gram.table <- function(word_list, n)
{
n.gram.medium <- sapply(word_list, n.gram, n = n)
n.gram.medium <- table(unlist(n.gram.medium))
props <- n.gram.medium/sum(n.gram.medium)
n.gram.medium <- data.frame(n.gram.medium, props)
colnames(n.gram.medium) <- c("N.Gram", "Freq"," ", "Prop")
n.gram.medium <- n.gram.medium[order(-n.gram.medium$Prop), ]
# Setup for plotting
n.gram.medium$N.Gram <- factor(n.gram.medium$N.Gram,
levels = n.gram.medium$N.Gram[order(n.gram.medium$Freq)])
return(n.gram.medium[,-3])
}
twitter.1.gram <- n.gram.table(US_twitter_list, n = 1)
twitter.2.gram <- n.gram.table(US_twitter_list, n = 2)
twitter.3.gram <- n.gram.table(US_twitter_list, n = 3)
Finally, we can plot the relative proportions of the n-grams. I chose to plot proportions because they highlight the popularity of the n-grams relative to each other and relative to the whole data set. The top ten most frequent n-grams are displayed. The twitter data are chosen here specifically but this can easily be applied to the other data sets.
library(ggplot2)
library(RColorBrewer)
n.gram.plot <- function(n.gram.df, topn, name)
{
custom.pal <- colorRampPalette(brewer.pal(6,"Blues"))(15)
ggplot(n.gram.df[1:topn,], aes(x = N.Gram, y = Prop, fill = Prop)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = custom.pal[5], high = custom.pal[10]) +
ggtitle(name) +
coord_flip() +
theme(legend.position = "none")
}
n.gram.plot(twitter.1.gram, 10, "Twitter 1-Gram Relative Proportions")
n.gram.plot(twitter.2.gram, 10, "Twitter 2-Gram Relative Proportions")
n.gram.plot(twitter.3.gram, 10, "Twitter 3-Gram Relative Proportions")
We can clearly see that as the “n” in n-gram increases, the absolute proportions fall precipitously. Also, as n increases the proportions seem to flatten out quickly. The top two 3-gram entries are actually saying the same thing but the capital “T” leads to two different counts. This is something to keep in mind going forward.
Clean and reliable data are key to building a good model. Scrubbing the data meticulously should be the first step in building the algorithm. The code above is a good start to cleaning the data but it still needs some work. Some issues to consider going forward are:
I am exploring several options for the model’s structure. A simple starting point is to predict using the last word from observed n-grams. Several predictions could be ranked in order of their observed past frequencies. The most frequently observed last word from the n-gram will be the first prediction. This model is pretty simple when sticking to n-grams of a particular size, like two. However, larger n-grams still need to be considered along with their interactions with smaller n-grams. Overall differences between the text sources need to be analyzed before the data are lumped together into the model. These include difference in style, sentence structure, term frequencies, spelling, and n-gram distributions. This analysis is beyond the scope of this milestone report but it will be considered during subsequent steps in the project.