Exploratory Analysis of English Language Corpora

Purpose

This project is the Capstone project from the Johns Hopkins Univeristy Data Science Specialization on Coursera in conjunction with SwiftKey. The purpose of this project is an exercise in Natural Language Processing in the design of a predictive text application to be potentially used on a cellphone as does the SwiftKey App. The data was acquired from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, compliments of Coursera, Johns Hopkins University, and SwiftKey.

The Data

The corpora were acquired from a series of blogs, twitter feeds, and news feeds via a web crawler
The corpora provided are in English, Russian, German, and Finnish
Because the data provided are real, they include profanity as well as foreign words that should be filtered from the predictive text

Data Acquisition

The data were aquired dricetly from the zip files using the readLines() package, grouped together as a list, and then analyzed for the number of lines, words, size, and the maximum length of the lines in each package. As expected, the maximum length of the twitter file is 140 characters, which clearly demonstrated that this code was working properly.

Corpora Features
Source	LineCount	WordCount	Size	MaxLength
blog	899288	206825253	255.4 Mb	40833
news	1010242	203223159	257.3 Mb	11384
twitter	2360148	162758870	318.8 Mb	140

Cleaning the Data

Due to the size of the data sets shown in the previous slide, the data were sampled to accommodate the processing power of the system used. I sampled 0.5 % of each data set prior to tokenization. The tokenization step is where most of the “junk” was filtered from the data. To prevent words from being counted twice based on capitalization, all words were set to lower case. Numbers, symbols, punctuation, url codes, hyphens, and stopwords were filtered from the tokens. Each of the corpora were analyzed separately before combining the corpora and analyzing as a whole. The top 10 words and frequencies from each are shown below.

The top ten words with frequencies from the blog corpora are:

##  one like  can time just  get make   go  day know 
##  704  573  537  507  493  480  411  410  397  363

The top ten words with frequencies from the news corpora are:

##  said  year   one  time   new first state   two   say  like 
##  1280   624   463   378   343   322   322   313   303   299

The top ten words with frequencies from the Twitter corpora are:

##   get  just    go thank  like  love   day  good   can    rt 
##   731   721   691   653   607   605   545   514   486   461

The top ten words with frequencies from the combined corpora are:

##  one said  get like just   go time  can year  day 
## 1611 1575 1509 1479 1469 1387 1312 1312 1223 1169

Visualizing the Data

To visualize the data, the matrix from the analysis was plotted as a word cloud, which has a clear representation of word frequency by size and color. Each word cloud shows the top hundred words from the respective corpora.

Further Analysis

To visualize the top 500 words by rank and frequency of occurrence, the data from the document-feature matrix were plotted with as a curve with a rug on the left side to visualize the frequency of the words. It is clear that as the word rank increases (less frequent), there are exponentially fewer occurrences.

N-Gram Analysis

To futher investigate common word groupings, I used an n-gram analysis for 2-word and 3-word associations.

The top 10 word pairs and trios are shown in the tables below:

2-Gram Frequencies
right_now	117
new_york	115
year_old	93
last_year	82
last_night	75
feel_like	73
years_ago	72
last_week	66
looking_forward	64
can_get	61

3-Gram Frequencies
president_barack_obama	16
new_york_city	15
happy_new_year	14
let_us_know	13
world_war_ii	12
happy_mother’s_day	10
read_poem_read	10
poem_read_poem	10
tick_tick_tick	10
happy_mothers_day	8

Visualizing the N-Gram Analysis

Similar to the analysis of the single word analysis, the n-gram analsyses were plotted in word clouds as well as graphed by the log of frequency versus rank of occurrences. One of the limitations of the n-gram word cloud seems to be with the length of the words - in the case that the word combination is too long, for example “president_barack_obama”, the n-gram is excluded from the word cloud.

Corpora Library Analysis

When analyzing the corpora, it is important to recognize the proportion of unique words and their frequency of usage to assess the scope of the full library. Here I looked at the number of unique words make up 50% and 90% of the library established by the sampled corpora from the combination of blogs, news, and twitter feeds.

Unique Word Composition of Corpora Library
600 unique words make up about 50 percent of the language
25816 unique words make up about 90 percent of the language

Discussion

Some interesting findings came up in the analyses. Prior to the log transform of the frequencies, there was just severe overplotting in the lower frequencies, making it difficult to interpret the distribution of the higher ranking words. Once plotted as a logarithm with the carpet on the left, it showed that there was a large cluster of words in the three separate corpora with the ranks between 200 and 400 that had much higher frequencies than did the words ranked greater than 500. When the three corpora were combined, this pattern was still prevalent, but at a larger sample of the words.

In the n-gram analyses, the 2-gram and 3-gram analyses had a very prevalent sampling effect that on the frequency graph appeared as clear tiers. While not ideal, these still gave a good insight as to the most frequent word pairs and trios, even if not conclusive. As mentioned previously, not all of the highest frequency groupings were plotted in the word cloud, but what is interesting about those that did are some of the strange groupings like “thai_restaurant_wilmington”, which likely reflects more of the time-period when the raw data were collected and the low frequency at which many 3-gram trios appear in the corpora.

While it would be ideal to actually do 2-gram and 3-gram analyses of the entire corpora, the processing power and time required to do so will not suffice for the proposed application of this project.

Appendix

Setup

knitr::opts_chunk$set(echo = TRUE)
library(readtext)
library(quanteda)
library(RColorBrewer)
library(wordcloud)
library(ggplot2)
library(kableExtra)
set.seed(1117)

Data Acquisition

## This opens connections to the zip files containing the raw data, reads the data, and then closes the connections. 
options(warn = -1)
con <- unz("./Coursera-Swiftkey.zip","final/en_US/en_US.blogs.txt")
blog <- readLines(con, encoding = 'UTF-8', skipNul = T)
close(con)

con <- unz("./Coursera-Swiftkey.zip","final/en_US/en_US.news.txt")
news <- readLines(con, encoding = 'UTF-8', skipNul = T)
close(con)

con <- unz("./Coursera-Swiftkey.zip","final/en_US/en_US.twitter.txt")
twitter <- readLines(con, encoding = 'UTF-8', skipNul = T)
close(con)

## At this point, the readfiles are listed together and then converted to all lower case. An empty data frame is then established for the file aspects 
corpus.list <- list(blog=blog, news=news, twitter=twitter)
corpus.list <- sapply(corpus.list, tolower) # converts all text to lower case
counts.df <- data.frame(Source = c('blog', 'news', 'twitter'), LineCount = NA, WordCount = NA, Size = NA, MaxLength = NA) 


## File aspect analysis is returned to the data frame 'counts.df'
counts.df$LineCount <- sapply(corpus.list, function(x){length(x)}) # Counts the number of lines in each part of the list
counts.df$WordCount <- sapply(corpus.list, function(x){sum(nchar(x))}) #reads the number of words per string and sums them together
counts.df$Size <- sapply(corpus.list, function(x){format(object.size(x),"MB")}) # returns the size of the file
counts.df$MaxLength <- sapply(corpus.list, function(x){max(unlist(lapply(x, function(y) nchar(y))))}) # returns the maximum line length

## The data frame 'counts.df' is presented as a kable so it looks nice
knitr::kable(counts.df, caption = "Corpora Features")

Cleaning the Data

## Cleaning the Blog Sample - first the entire blog corpus is sampled at 0.5% of the full corpus. Next the sample is tokenized, removing the numbers, punctuation, symbols, hyphens, and stopwords. 
blog.sample <- sample(corpus.list$blog,length(corpus.list$blog) * 0.005)
blog.tokens <- tokens(blog.sample, remove_numbers = T, remove_punc = T, remove_symbols = T, remove_hyphens = T, remove_url = T)
blog.stopwords <- tokens_remove(blog.tokens, stopwords("english"))

## The cleaned data are converted into a Data-Feature Matrix, and then the top 10 most frequenty occuring tokens are presented. This process was done for the news, twitter, and combined corpora. 
blog.DFM <- dfm(blog.stopwords, stem = T)
blogFeat <- topfeatures(blog.DFM, 10)
print(blogFeat)

##Cleaning the News Sample
news.sample <- sample(corpus.list$news, length(corpus.list$news) * 0.005)
news.tokens <- tokens(news.sample, remove_numbers = T, remove_punc = T, remove_symbols = T, remove_hyphens = T, remove_url = T)
news.stopwords <- tokens_remove(news.tokens, stopwords("english"))

news.DFM <- dfm(news.stopwords, stem = T)
newsFeat <- topfeatures(news.DFM, 10)
print(newsFeat)

## Cleaning the twitter corpora 
twitter.sample <- sample(corpus.list$twitter,length(corpus.list$twitter) * 0.005) 

twitter.tokens <- tokens(twitter.sample, remove_numbers = T, remove_punc = T, remove_symbols = T, remove_hyphens = T, remove_twitter = T, remove_url = T)
twitter.stopwords <- tokens_remove(twitter.tokens, stopwords("english"))

twitter.DFM <- dfm(twitter.stopwords, stem = T)
twitterFeat <- topfeatures(twitter.DFM, 10)
print(twitterFeat)

# Cleaning the Combined corpora - the three corpora (blog, news, and twitter) were concatenated and then the same process was applied as previously done with the individual corpora. 
combined.sample <- c(blog.sample, news.sample, twitter.sample)
combined.tokens <- tokens(combined.sample, remove_numbers = T, remove_punct = T, remove_symbols = T, remove_hyphens = T, remove_url = T)
combined.stopwords <- tokens_remove(combined.tokens, stopwords("english"))

combined.DFM <- dfm(combined.stopwords, stem = T)
combinedFeat <- topfeatures(combined.DFM, 10)
print(combinedFeat)

Visualization of the Data - Wordclouds

library(quanteda)
blog.DFM <- blog.DFM
news.DFM <- news.DFM
twitter.DFM <- twitter.DFM
combined.DFM <- combined.DFM

# Wordcloud for blog corpus
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Blog Corpus Word Cloud")
textplot_wordcloud(blog.DFM, random_order = F, rotation = .3, max_words = 100, color = RColorBrewer::brewer.pal(8, "Dark2"))
# Wordcloud for news corpus
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "News Corpus Word Cloud")
textplot_wordcloud(news.DFM, random_order = F, rotation = .3, max_words = 100, color = RColorBrewer::brewer.pal(8, "Dark2"))
# Wordcloud for the twitter corpus
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Twitter Corpus Word Cloud")
textplot_wordcloud(twitter.DFM, random_order = F, rotation = .3, max_words = 100, color = RColorBrewer::brewer.pal(8, "Dark2"))
# Wordcloud for combined corpora
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Combined Corpora Word Cloud")
textplot_wordcloud(combined.DFM, random_order = F, rotation = .3, max_words = 100, color = RColorBrewer::brewer.pal(8, "Dark2"))

Visualization of the Data - Frequency Plots

library(ggplot2)
## The plots use a log transform on the y axis, and also have a rug to show more distinct frequencies instead of the individual points as on a scatter or point plot. 
bplot <- textstat_frequency(blog.DFM, n = 500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Rank", y = "Log of Frequency") + ggtitle("Frequency of Words in Blog Corpora")

nplot <- textstat_frequency(news.DFM, n = 500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Rank", y = "Log of Frequency") + ggtitle("Frequency of Words in News Corpora")

tplot <- textstat_frequency(twitter.DFM, n = 500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Rank", y = "Log of Frequency") + ggtitle("Frequency of Words in Twitter Corpora")

## The three individual corpora had plots of 500 poins, but the see the same relationship with the combined sample, this was extended to the 1500 data points to accommodate the concatenation of the three groups of 500. 
cplot <- textstat_frequency(combined.DFM, n = 1500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Rank", y = "Log of Frequency") + ggtitle("Frequency of Words in Combined Corpora")

print(bplot)
print(nplot)
print(tplot)
print(cplot)

N-Gram Analysis

## 2-Gram Analysis using tokens_ngrams
ngram2 <- tokens_ngrams(combined.stopwords, 2)
ngram2.DFM <- dfm(ngram2)
ngram2Feat <- topfeatures(ngram2.DFM, 10)

## 3-Gram Analysis
ngram3 <- tokens_ngrams(combined.stopwords, 3)
ngram3.DFM <- dfm(ngram3)
ngram3Feat <- topfeatures(ngram3.DFM, 10)

## Putting the n-gram analysis results in kables to make the results look nice
knitr::kable(ngram2Feat, caption = "2-Gram Frequencies", full_width=F, col.names=NULL)

knitr::kable(ngram3Feat, caption = "3-Gram Frequencies", full_width=F, col.names=NULL)

Visualizing the N-Gram analysis - word cloud

# 2-Gram Word Cloud 
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "2-Gram Corpora Word Cloud")
textplot_wordcloud(ngram2.DFM, random_order = T, rotation = .3, max_words = 50, color = RColorBrewer::brewer.pal(8, "Dark2"))

# 3-Gram Word Cloud 
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "3-Gram Corpora Word Cloud")
textplot_wordcloud(ngram3.DFM, random_order = T, rotation = 0.2, max_words = 20, color = RColorBrewer::brewer.pal(8, "Dark2"))

Visualizing the N-Gram analysis - line-plot

library(ggplot2)
## Log transformed frequenecy vs. rank plots of 2-gram and 3-gram data
n2plot <- textstat_frequency(ngram2.DFM, n = 500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Pair Rank", y = "Log of Frequency") + ggtitle("Word Pairs in Combined Corpora")

n3plot <- textstat_frequency(ngram3.DFM, n = 250) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Trio Rank", y = "Log of Frequency") + ggtitle("Word Trios in Combined Corpora")

print(n2plot)
print(n3plot)

Corpora Library Analysis

combo <- textstat_frequency(combined.DFM)
## Takes the sum of the unique words from each row from the combined corpora, then calculates 50% of the sum of all unique words in the library to determine 50% of the uniques words in the language. The same process is applied to calculate 90%.  
sum50 <- 0
for(i in 1:nrow(combo)) {
  sum50 <- sum50 + combo$frequency[i]
  if(sum50 >= 0.5*sum(combo$frequency)){break}
}

s50 <- sprintf("%d unique words make up about 50 percent of the language", i)

sum90 <- 0
for(i in 1:nrow(combo)) {
  sum90 <- sum90 + combo$frequency[i]
  if(sum50 >= 0.9*sum(combo$frequency)){break}
}

s90 <- sprintf("%d unique words make up about 90 percent of the language", i)

## Results are printed nicely in a kable. 
knitr::kable(c(s50, s90), caption="Unique Word Composition of Corpora Library", col.names=NULL)

Exploratory Analysis of English Language Corpora

Justin Papreck

October 28, 2018

Purpose

The Data

Data Acquisition

Cleaning the Data

Visualizing the Data

Further Analysis

N-Gram Analysis

Visualizing the N-Gram Analysis

Corpora Library Analysis

Discussion

Appendix