This brief report is part of the capstone project for the Johns Hopkins Data Science Specialization sequence on Coursera. Herein the goal is to perform a basic data-loading procedure and report on some of the major features of a body of text as revealed by exploratory visualization. The ultimate goal (though not within the scope of this report) is to train an algorithm for predicting the next word when a user types something. By looking at a large body of text from various sources, the idea is that the machine should be able to learn some of the most common patterns in modern English and make predictions based on them.
The text data to be used here can be found by downloading the file located at [https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip]. We are loading the raw files into our working directory with the following R code:
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url,destfile="./data.zip",method="curl")
unzip('./data.zip', files = NULL, list = FALSE, overwrite = TRUE, junkpaths = FALSE, exdir = ".", unzip = "internal",setTimes = FALSE)
The English-language data we are interested is located in the path ./final/en_US and consists of three .txt files, en_US.blogs, en_US.news, and en_US.twitter which represents a cross section of web-based text in three familiar formats. We expect there might be some differences across them- for example, tweets are limited to 140 characters so there will doubtless be more creative space-saving techniques (“u” insteat of “you”) within that set. The other two files feature fragments ranging from single sentences to entire articles.
Let’s explore the blog set- it’s not unreasonable to consider it a stylistic ‘middle ground’ between the news articles and the tweets. Later, we will explore all three sets together.
First we have a function that samples the fragments of a .txt file for use in the analysis later. This is necessary because the files are very large and it should not be necessary to use all of the lines to create a model. The function utilizes only base R, taking a file path and a proportion of the total lines to store for analysis.
### Create sampling function.
samp <- function(file,frac) {
fl<-readLines(file)
fl_samp<-sample(fl,frac*length(fl),replace=FALSE)
return(fl_samp)
}
### Sample 3% of blogging lines.
blog_samp<-samp("./final/en_US/en_US.blogs.txt",0.03)
### Read in whole blogging set for comparison of major features
blog<-readLines("./final/en_US/en_US.blogs.txt")
The blog data needs to be cleaned and placed into a form for NLP analysis. We will be using the quanteda package for this, starting with the creation of a ‘corpus’ object and followed by the ‘tokenization’- the separation of the text into lowercase words so that we may see the patterns within them.
### Create a function for tokenizing the raw text.
tok <- function(text) {
library("quanteda") # corpus and token objects
library("sentimentr") # profanity filtering
corp<-corpus(text) # corpus object for tokenization
toks <- tokens(corp, what = "word", remove_numbers = TRUE,remove_punct = TRUE, remove_separators = TRUE,remove_url = TRUE, remove_symbols = TRUE) %>% # tokens objects for separating into words
tokens_tolower(keep_acronyms = FALSE) %>% # make lowercase
tokens_select(selection="remove",pattern=lexicon::profanity_alvarez) # filter profanity
return(toks)
}
### Create token objects from blogging set.
set.seed(43022)
blog_samp_tok<-tok(blog_samp) # tokens for sample
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 4 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
blog_tok<-tok(blog) # tokens of whole set
Let’s compare the two sets of blogging data to see if our sample is representative. We have a function for visualizing the top 20 words with ggplot2.
### Write a function for viewing the top tokens.
freq <- function(toks,k) { # k is the number of top tokens to view
library("quanteda.textstats") # feature extraction
library("ggplot2") # plotting
library("gridExtra") # plot arrangement
dfm<-dfm(toks) %>% dfm_weight(scheme = "prop") # quanteda object for corpus analysis
feat<-textstat_frequency(dfm, n = k)
feat$feature <- with(feat, reorder(feature, -frequency))
feat_plot<-ggplot(feat, aes(x = feature, y = frequency))+
geom_point()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
labs(x=NULL,y="Relative Feature Frequency")
return(feat_plot)
}
### Create plots for sampled and whole blog data.
blog_samp_freq<-freq(blog_samp_tok,20)
blog_freq<-freq(blog_tok,20)
grid.arrange(blog_samp_freq, blog_freq,nrow = 2,top = "3% vs. 100% of Blogging Corpus")
There are word combinations that you would expect to be more common than others. Of the top 20 words frequencies in the blog set, there is hardly a difference in relative frequency for the the 3% sample vs. the entirety. We’re using this observation to justify using 3% of each dataset, at least for initial exploration.
### Sample in Twitter and News sets.
twit_samp<-samp("./final/en_US/en_US.twitter.txt",0.03)
## Warning in readLines(file): line 167155 appears to contain an embedded nul
## Warning in readLines(file): line 268547 appears to contain an embedded nul
## Warning in readLines(file): line 1274086 appears to contain an embedded nul
## Warning in readLines(file): line 1759032 appears to contain an embedded nul
news_samp<-samp("./final/en_US/en_US.news.txt",0.03)
### Combine Samples.
text_samp<-c(twit_samp,news_samp,blog_samp)
text_tok<-tok(text_samp)
To conclude this initial exploration, let’s show the top 20 words and up to 3-grams for the whole sampled corpus.
### Create n-gram arrays.
text_2gram <- tokens_ngrams(text_tok,2)
text_3gram <- tokens_ngrams(text_tok,3)
### Plot summary for all texts.
text_freq<-freq(text_tok,20)
text_2_freq<-freq(text_2gram,20)
text_3_freq<-freq(text_3gram,20)
grid.arrange(text_freq,text_2_freq,text_3_freq,nrow = 1,ncol=3, top = "Most Common 1- 2- and 3-grams in 3% of All Corpi")
In sum, we will be using these sets of n-grams to create a simple ‘backoff’ model for word prediction. Given two words, we will predict a third based on searching for a match in the list of all 3-grams. It will return the word matching the most common combination. If no match is found, the algorithm will “back off” to the 2 grams and look for a match for the last word only. to accomplish this, it will be necessary to get the n-gram data into a data frame format along with counts of all duplicates. I will conclude with an example of using the function created to generate these data tables:
count_2gram<-function(tokens_2grams){
suppressMessages(library(tidyverse))
dat_2gram<-as.list(tokens_2grams) %>% unlist() %>% as.data.frame()
dat_2gram<-separate(dat=dat_2gram,col=., into = c("word1", "word2"), sep="_")
counted_dat_2gram<-count(dat_2gram,word1,word2,sort = TRUE)
return(counted_dat_2gram)
}
frame_2gram<-count_2gram(text_2gram)
## Warning: Expected 2 pieces. Additional pieces discarded in 490 rows [1655, 1786,
## 4062, 6049, 6050, 11362, 11363, 13500, 13753, 16467, 17566, 17567, 19309, 19408,
## 22171, 22402, 25463, 27677, 37692, 39076, ...].
head(frame_2gram)
## word1 word2 n
## 1 of the 12911
## 2 in the 12175
## 3 to the 6266
## 4 for the 5933
## 5 on the 5866
## 6 to be 4923