This milestone report serves to show my progress with the final project for the capstone course in the data science specialization provided by Johns Hopkins University on Coursera. I will perform an exploratory data analysis to learn the structure of the data set we will be using for the final project.
First we should read in the data. I use guess_encoding() here to make sure the files are read in with the proper encoding.
# Get directories of the data files that we want to read
us_blogs_dir <- "./final/en_us/en_US.blogs.txt"
us_news_dir <- "./final/en_us/en_US.news.txt"
us_twitter_dir <- "./final/en_us/en_US.twitter.txt"
# Guess encoding for each file
us_blogs_encoding <- guess_encoding(us_blogs_dir, n_max=1000)$encoding[1]
us_news_encoding <- guess_encoding(us_news_dir, n_max=1000)$encoding[1]
us_twitter_encoding <- guess_encoding(us_twitter_dir, n_max=1000)$encoding[1]
# Read in files line by line
us_blogs <- readLines(us_blogs_dir, encoding=us_blogs_encoding, warn=FALSE)
us_news <- readLines(us_news_dir, encoding=us_news_encoding, warn=FALSE)
us_twitter <- readLines(us_twitter_dir, encoding=us_twitter_encoding, warn=FALSE)
As an initial exploratory measure, I’ll get the sizes and number on lines in each file.
# calculate file sizes in MB
blogs_file_size <- file.info(us_blogs_dir)$size/(1024^2)
news_file_size <- file.info(us_news_dir)$size/(1024^2)
twitter_file_size <- file.info(us_twitter_dir)$size/(1024^2)
# Combine file sizes
file_sizes <- rbind(blogs_file_size, news_file_size, twitter_file_size)
# Count number of lines in each file
blogs_file_lines <- countLines(us_blogs_dir)
news_file_lines <- countLines(us_news_dir)
twitter_file_lines <- countLines(us_twitter_dir)
# Combine number of lines
num_lines <- rbind(blogs_file_lines, news_file_lines, twitter_file_lines)
# Combine file encodings
encodings <- rbind(us_blogs_encoding, us_news_encoding, us_twitter_encoding)
# Combine both stats
file_stats <- as.data.frame(cbind(file_sizes, num_lines, encodings))
colnames(file_stats) <- c("File Size (in MB)","Number of Lines", "File Encoding")
rownames(file_stats) <- c("Blogs", "News", "Twitter")
file_stats
## File Size (in MB) Number of Lines File Encoding
## Blogs 200.424207687378 899288 UTF-8
## News 196.277512550354 1010242 UTF-8
## Twitter 159.364068984985 2360148 UTF-8
These data sets are very large so we have to take samples from them to make the data managable.
# Set seed
set.seed(12345)
# Grab samples from raw data
blogs_sample <- sample(us_blogs, size=10000)
news_sample <- sample(us_news, size=10000)
twitter_sample <- sample(us_twitter, size=10000)
Now we can combine the samples into a single text corpus
# Combine sample sets to create corpus for training
corpus_raw <- c(blogs_sample, news_sample, twitter_sample)
The raw data from the previous steps takes up quite a bit of memory so let’s remove them to free up some space.
# Remove raw-er data sets
rm(us_blogs, blogs_sample,
us_news, news_sample,
us_twitter, twitter_sample)
Now we can remove some unwanted words and punctutation characters. Let’s make a function that will make things easier.
# changes special characters to a space character
change_to_space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
Begin cleaning!
# Remove non-ASCII characters
corpus <- iconv(corpus_raw, "UTF-8", "ASCII", sub="")
# Make corpus
corpus <- VCorpus(VectorSource(corpus))
## Begin cleaning
# Lowercase all characters
corpus <- tm_map(corpus, content_transformer(tolower))
# Strip whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove punctuation characters
corpus <- tm_map(corpus, removePunctuation)
# Remove other characters
corpus <- tm_map(corpus, change_to_space, "/|@|\\|")
# Remove stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
Now we can create our N-gram models. For our purposes we will only go up to trigrams.
delims <- " \\r\\n\\t.,;:\"()?!"
tokenize_uni <- function(x){NGramTokenizer(x, Weka_control(min=1, max=1, delimiters=delims))}
tokenize_bi <- function(x){NGramTokenizer(x, Weka_control(min=2, max=2, delimiters=delims))}
tokenize_tri <- function(x){NGramTokenizer(x, Weka_control(min=3, max=3, delimiters=delims))}
unigram <- TermDocumentMatrix(corpus, control=list(tokenize=tokenize_uni))
bigram <- TermDocumentMatrix(corpus, control=list(tokenize=tokenize_bi ))
trigram <- TermDocumentMatrix(corpus, control=list(tokenize=tokenize_tri))
Let’s count up our most frequent tokens from each N-gram.
# Transform N-grams structure to pull out token frequencies
unigram_r <- rollup(unigram, 2, na.rm = TRUE, FUN = sum)
bigram_r <- rollup( bigram, 2, na.rm = TRUE, FUN = sum)
trigram_r <- rollup(trigram, 2, na.rm = TRUE, FUN = sum)
# Get token frequencies of each N-gram
unigram_tokens_counts <- data.frame(Token = unigram$dimnames$Terms, Frequency = unigram_r$v)
bigram_tokens_counts <- data.frame(Token = bigram$dimnames$Terms, Frequency = bigram_r$v)
trigram_tokens_counts <- data.frame(Token = trigram$dimnames$Terms, Frequency = trigram_r$v)
# Sort tokens by frequency, filter top 100 most frequent
top_unigram <- unigram_tokens_counts %>%
arrange(desc(Frequency)) %>% top_n(100, Frequency)
top_bigram <- bigram_tokens_counts %>%
arrange(desc(Frequency)) %>% top_n(100, Frequency)
top_trigram <- trigram_tokens_counts %>%
arrange(desc(Frequency)) %>% top_n(100, Frequency)
Finally, we can make some barplots showing the frequencies of the most common tokens in each N-gram.
make_ngram_barplot <- function(x, top_n, n, color){
main_title <- paste("Top", as.character(top_n), "most frequent", n)
ggplot(x[1:top_n,], aes(reorder(Token, -Frequency), Frequency)) +
geom_bar(stat="identity", fill=I(color)) +
labs(x=n, y="Frequency") + ggtitle(main_title) +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1))
}
make_ngram_barplot(top_unigram, 20, "Unigrams", "red")
make_ngram_barplot(top_bigram, 20, "Bigrams", "blue")
make_ngram_barplot(top_trigram, 20, "Trigrams", "yellow")
We can also make word clouds as another way to visualize our token frequencies.
make_word_cloud <- function(x, s, max_words){
wordcloud(x[,1], x[,2], scale=s,
min.freq=5, max.words=max_words, random.order=FALSE,
rot.per=0.5, colors=brewer.pal(8, "Dark2"),
use.r.layout=FALSE)
}
make_word_cloud(top_unigram, c(3.0, 0.1), 100)
make_word_cloud(top_bigram, c(2.3, 0.1), 100)
make_word_cloud(top_trigram, c(1.5, 0.1), 100)
It seems that we will have to sacrifice a significant portion of our model’s accuracy for the sake of runtime. Even though the sample size is pretty small compared to the raw data set, it still takes quite a while to construct our ngrams data. I think a sample size of 30,000 will be sufficient. I think a better job could be done with the data cleaning, as some of the trigrams seem to be a bit odd.
For the final project I plan on training a model using the N-grams constructed here and deploying it into a shiny application that will predict the next word from a user’s input.