This project is the Capstone project from the Johns Hopkins Univeristy Data Science Specialization on Coursera in conjunction with SwiftKey. The purpose of this project is an exercise in Natural Language Processing in the design of a predictive text application to be potentially used on a cellphone as does the SwiftKey App. The data was acquired from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, compliments of Coursera, Johns Hopkins University, and SwiftKey.
The data were aquired dricetly from the zip files using the readLines() package, grouped together as a list, and then analyzed for the number of lines, words, size, and the maximum length of the lines in each package. As expected, the maximum length of the twitter file is 140 characters, which clearly demonstrated that this code was working properly.
Source | LineCount | WordCount | Size | MaxLength |
---|---|---|---|---|
blog | 899288 | 206825253 | 255.4 Mb | 40833 |
news | 1010242 | 203223159 | 257.3 Mb | 11384 |
2360148 | 162758870 | 318.8 Mb | 140 |
Due to the size of the data sets shown in the previous slide, the data were sampled to accommodate the processing power of the system used. I sampled 0.5 % of each data set prior to tokenization. The tokenization step is where most of the “junk” was filtered from the data. To prevent words from being counted twice based on capitalization, all words were set to lower case. Numbers, symbols, punctuation, url codes, hyphens, and stopwords were filtered from the tokens. Each of the corpora were analyzed separately before combining the corpora and analyzing as a whole. The top 10 words and frequencies from each are shown below.
The top ten words with frequencies from the blog corpora are:
## one like can time just get make go day know
## 704 573 537 507 493 480 411 410 397 363
The top ten words with frequencies from the news corpora are:
## said year one time new first state two say like
## 1280 624 463 378 343 322 322 313 303 299
The top ten words with frequencies from the Twitter corpora are:
## get just go thank like love day good can rt
## 731 721 691 653 607 605 545 514 486 461
The top ten words with frequencies from the combined corpora are:
## one said get like just go time can year day
## 1611 1575 1509 1479 1469 1387 1312 1312 1223 1169
To visualize the data, the matrix from the analysis was plotted as a word cloud, which has a clear representation of word frequency by size and color. Each word cloud shows the top hundred words from the respective corpora.
To visualize the top 500 words by rank and frequency of occurrence, the data from the document-feature matrix were plotted with as a curve with a rug on the left side to visualize the frequency of the words. It is clear that as the word rank increases (less frequent), there are exponentially fewer occurrences.
To futher investigate common word groupings, I used an n-gram analysis for 2-word and 3-word associations.
The top 10 word pairs and trios are shown in the tables below:right_now | 117 |
new_york | 115 |
year_old | 93 |
last_year | 82 |
last_night | 75 |
feel_like | 73 |
years_ago | 72 |
last_week | 66 |
looking_forward | 64 |
can_get | 61 |
president_barack_obama | 16 |
new_york_city | 15 |
happy_new_year | 14 |
let_us_know | 13 |
world_war_ii | 12 |
happy_mother’s_day | 10 |
read_poem_read | 10 |
poem_read_poem | 10 |
tick_tick_tick | 10 |
happy_mothers_day | 8 |
Similar to the analysis of the single word analysis, the n-gram analsyses were plotted in word clouds as well as graphed by the log of frequency versus rank of occurrences. One of the limitations of the n-gram word cloud seems to be with the length of the words - in the case that the word combination is too long, for example “president_barack_obama”, the n-gram is excluded from the word cloud.
When analyzing the corpora, it is important to recognize the proportion of unique words and their frequency of usage to assess the scope of the full library. Here I looked at the number of unique words make up 50% and 90% of the library established by the sampled corpora from the combination of blogs, news, and twitter feeds.
600 unique words make up about 50 percent of the language |
25816 unique words make up about 90 percent of the language |
Some interesting findings came up in the analyses. Prior to the log transform of the frequencies, there was just severe overplotting in the lower frequencies, making it difficult to interpret the distribution of the higher ranking words. Once plotted as a logarithm with the carpet on the left, it showed that there was a large cluster of words in the three separate corpora with the ranks between 200 and 400 that had much higher frequencies than did the words ranked greater than 500. When the three corpora were combined, this pattern was still prevalent, but at a larger sample of the words.
In the n-gram analyses, the 2-gram and 3-gram analyses had a very prevalent sampling effect that on the frequency graph appeared as clear tiers. While not ideal, these still gave a good insight as to the most frequent word pairs and trios, even if not conclusive. As mentioned previously, not all of the highest frequency groupings were plotted in the word cloud, but what is interesting about those that did are some of the strange groupings like “thai_restaurant_wilmington”, which likely reflects more of the time-period when the raw data were collected and the low frequency at which many 3-gram trios appear in the corpora.
While it would be ideal to actually do 2-gram and 3-gram analyses of the entire corpora, the processing power and time required to do so will not suffice for the proposed application of this project.
Setup
knitr::opts_chunk$set(echo = TRUE)
library(readtext)
library(quanteda)
library(RColorBrewer)
library(wordcloud)
library(ggplot2)
library(kableExtra)
set.seed(1117)
Data Acquisition
## This opens connections to the zip files containing the raw data, reads the data, and then closes the connections.
options(warn = -1)
con <- unz("./Coursera-Swiftkey.zip","final/en_US/en_US.blogs.txt")
blog <- readLines(con, encoding = 'UTF-8', skipNul = T)
close(con)
con <- unz("./Coursera-Swiftkey.zip","final/en_US/en_US.news.txt")
news <- readLines(con, encoding = 'UTF-8', skipNul = T)
close(con)
con <- unz("./Coursera-Swiftkey.zip","final/en_US/en_US.twitter.txt")
twitter <- readLines(con, encoding = 'UTF-8', skipNul = T)
close(con)
## At this point, the readfiles are listed together and then converted to all lower case. An empty data frame is then established for the file aspects
corpus.list <- list(blog=blog, news=news, twitter=twitter)
corpus.list <- sapply(corpus.list, tolower) # converts all text to lower case
counts.df <- data.frame(Source = c('blog', 'news', 'twitter'), LineCount = NA, WordCount = NA, Size = NA, MaxLength = NA)
## File aspect analysis is returned to the data frame 'counts.df'
counts.df$LineCount <- sapply(corpus.list, function(x){length(x)}) # Counts the number of lines in each part of the list
counts.df$WordCount <- sapply(corpus.list, function(x){sum(nchar(x))}) #reads the number of words per string and sums them together
counts.df$Size <- sapply(corpus.list, function(x){format(object.size(x),"MB")}) # returns the size of the file
counts.df$MaxLength <- sapply(corpus.list, function(x){max(unlist(lapply(x, function(y) nchar(y))))}) # returns the maximum line length
## The data frame 'counts.df' is presented as a kable so it looks nice
knitr::kable(counts.df, caption = "Corpora Features")
Cleaning the Data
## Cleaning the Blog Sample - first the entire blog corpus is sampled at 0.5% of the full corpus. Next the sample is tokenized, removing the numbers, punctuation, symbols, hyphens, and stopwords.
blog.sample <- sample(corpus.list$blog,length(corpus.list$blog) * 0.005)
blog.tokens <- tokens(blog.sample, remove_numbers = T, remove_punc = T, remove_symbols = T, remove_hyphens = T, remove_url = T)
blog.stopwords <- tokens_remove(blog.tokens, stopwords("english"))
## The cleaned data are converted into a Data-Feature Matrix, and then the top 10 most frequenty occuring tokens are presented. This process was done for the news, twitter, and combined corpora.
blog.DFM <- dfm(blog.stopwords, stem = T)
blogFeat <- topfeatures(blog.DFM, 10)
print(blogFeat)
##Cleaning the News Sample
news.sample <- sample(corpus.list$news, length(corpus.list$news) * 0.005)
news.tokens <- tokens(news.sample, remove_numbers = T, remove_punc = T, remove_symbols = T, remove_hyphens = T, remove_url = T)
news.stopwords <- tokens_remove(news.tokens, stopwords("english"))
news.DFM <- dfm(news.stopwords, stem = T)
newsFeat <- topfeatures(news.DFM, 10)
print(newsFeat)
## Cleaning the twitter corpora
twitter.sample <- sample(corpus.list$twitter,length(corpus.list$twitter) * 0.005)
twitter.tokens <- tokens(twitter.sample, remove_numbers = T, remove_punc = T, remove_symbols = T, remove_hyphens = T, remove_twitter = T, remove_url = T)
twitter.stopwords <- tokens_remove(twitter.tokens, stopwords("english"))
twitter.DFM <- dfm(twitter.stopwords, stem = T)
twitterFeat <- topfeatures(twitter.DFM, 10)
print(twitterFeat)
# Cleaning the Combined corpora - the three corpora (blog, news, and twitter) were concatenated and then the same process was applied as previously done with the individual corpora.
combined.sample <- c(blog.sample, news.sample, twitter.sample)
combined.tokens <- tokens(combined.sample, remove_numbers = T, remove_punct = T, remove_symbols = T, remove_hyphens = T, remove_url = T)
combined.stopwords <- tokens_remove(combined.tokens, stopwords("english"))
combined.DFM <- dfm(combined.stopwords, stem = T)
combinedFeat <- topfeatures(combined.DFM, 10)
print(combinedFeat)
Visualization of the Data - Wordclouds
library(quanteda)
blog.DFM <- blog.DFM
news.DFM <- news.DFM
twitter.DFM <- twitter.DFM
combined.DFM <- combined.DFM
# Wordcloud for blog corpus
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Blog Corpus Word Cloud")
textplot_wordcloud(blog.DFM, random_order = F, rotation = .3, max_words = 100, color = RColorBrewer::brewer.pal(8, "Dark2"))
# Wordcloud for news corpus
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "News Corpus Word Cloud")
textplot_wordcloud(news.DFM, random_order = F, rotation = .3, max_words = 100, color = RColorBrewer::brewer.pal(8, "Dark2"))
# Wordcloud for the twitter corpus
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Twitter Corpus Word Cloud")
textplot_wordcloud(twitter.DFM, random_order = F, rotation = .3, max_words = 100, color = RColorBrewer::brewer.pal(8, "Dark2"))
# Wordcloud for combined corpora
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Combined Corpora Word Cloud")
textplot_wordcloud(combined.DFM, random_order = F, rotation = .3, max_words = 100, color = RColorBrewer::brewer.pal(8, "Dark2"))
Visualization of the Data - Frequency Plots
library(ggplot2)
## The plots use a log transform on the y axis, and also have a rug to show more distinct frequencies instead of the individual points as on a scatter or point plot.
bplot <- textstat_frequency(blog.DFM, n = 500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Rank", y = "Log of Frequency") + ggtitle("Frequency of Words in Blog Corpora")
nplot <- textstat_frequency(news.DFM, n = 500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Rank", y = "Log of Frequency") + ggtitle("Frequency of Words in News Corpora")
tplot <- textstat_frequency(twitter.DFM, n = 500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Rank", y = "Log of Frequency") + ggtitle("Frequency of Words in Twitter Corpora")
## The three individual corpora had plots of 500 poins, but the see the same relationship with the combined sample, this was extended to the 1500 data points to accommodate the concatenation of the three groups of 500.
cplot <- textstat_frequency(combined.DFM, n = 1500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Rank", y = "Log of Frequency") + ggtitle("Frequency of Words in Combined Corpora")
print(bplot)
print(nplot)
print(tplot)
print(cplot)
N-Gram Analysis
## 2-Gram Analysis using tokens_ngrams
ngram2 <- tokens_ngrams(combined.stopwords, 2)
ngram2.DFM <- dfm(ngram2)
ngram2Feat <- topfeatures(ngram2.DFM, 10)
## 3-Gram Analysis
ngram3 <- tokens_ngrams(combined.stopwords, 3)
ngram3.DFM <- dfm(ngram3)
ngram3Feat <- topfeatures(ngram3.DFM, 10)
## Putting the n-gram analysis results in kables to make the results look nice
knitr::kable(ngram2Feat, caption = "2-Gram Frequencies", full_width=F, col.names=NULL)
knitr::kable(ngram3Feat, caption = "3-Gram Frequencies", full_width=F, col.names=NULL)
Visualizing the N-Gram analysis - word cloud
# 2-Gram Word Cloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "2-Gram Corpora Word Cloud")
textplot_wordcloud(ngram2.DFM, random_order = T, rotation = .3, max_words = 50, color = RColorBrewer::brewer.pal(8, "Dark2"))
# 3-Gram Word Cloud
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "3-Gram Corpora Word Cloud")
textplot_wordcloud(ngram3.DFM, random_order = T, rotation = 0.2, max_words = 20, color = RColorBrewer::brewer.pal(8, "Dark2"))
Visualizing the N-Gram analysis - line-plot
library(ggplot2)
## Log transformed frequenecy vs. rank plots of 2-gram and 3-gram data
n2plot <- textstat_frequency(ngram2.DFM, n = 500) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Pair Rank", y = "Log of Frequency") + ggtitle("Word Pairs in Combined Corpora")
n3plot <- textstat_frequency(ngram3.DFM, n = 250) %>% ggplot(aes(x = rank, y = log10(frequency))) + geom_line() + geom_rug(sides = "l") + labs(x = "Word Trio Rank", y = "Log of Frequency") + ggtitle("Word Trios in Combined Corpora")
print(n2plot)
print(n3plot)
Corpora Library Analysis
combo <- textstat_frequency(combined.DFM)
## Takes the sum of the unique words from each row from the combined corpora, then calculates 50% of the sum of all unique words in the library to determine 50% of the uniques words in the language. The same process is applied to calculate 90%.
sum50 <- 0
for(i in 1:nrow(combo)) {
sum50 <- sum50 + combo$frequency[i]
if(sum50 >= 0.5*sum(combo$frequency)){break}
}
s50 <- sprintf("%d unique words make up about 50 percent of the language", i)
sum90 <- 0
for(i in 1:nrow(combo)) {
sum90 <- sum90 + combo$frequency[i]
if(sum50 >= 0.9*sum(combo$frequency)){break}
}
s90 <- sprintf("%d unique words make up about 90 percent of the language", i)
## Results are printed nicely in a kable.
knitr::kable(c(s50, s90), caption="Unique Word Composition of Corpora Library", col.names=NULL)