This report covers the Exploratory analysis of the data for the capstone project: a predictive text app using Shiny. The primary finding is that not all of the data will be needed to provide coverage. There is an exponential relationship between the amount of frequency coverage and the number of unique words.
library("readr")
library("ggplot2")
library("dplyr")
library("quanteda")
library("ngram")
I will begin by reading the txt files into R, using readr’s read_lines(). This was the easiest function, with no errors, after trying several others.
blog <- read_lines("final/en_US/en_US.blogs.txt")
news <- read_lines("final/en_US/en_US.news.txt")
twitter <- read_lines("final/en_US/en_US.twitter.txt")
Below we can see the basic stats of these files.
stats <- data.frame(source=c("Blog","News","Twitter"),
file_size = c(file.size("final/en_US/en_US.blogs.txt")/1024^2,
file.size("final/en_US/en_US.news.txt")/1024^2,
file.size("final/en_US/en_US.twitter.txt")/1024^2),
word_count = sapply(list(blog, news, twitter), wordcount),
line_count = c(length(blog), length(news), length(twitter)),
char_count = c(sum(nchar(blog)), sum(nchar(news)), sum(nchar(twitter))))
names(stats) <- c("Source", "File Size (MB)", "Word Count", "Line Count", "Character Count")
stats
## Source File Size (MB) Word Count Line Count Character Count
## 1 Blog 200.4242 37334131 899288 206824505
## 2 News 196.2775 34372530 1010242 203223159
## 3 Twitter 159.3641 30373543 2360148 162096031
I will then sample 10% of these data as a sufficiently large dataset to explore.
set.seed(123)
sample_blog <- blog[sample(length(blog), length(blog)*0.1)]
sample_news <- news[sample(length(news), length(news)*0.1)]
sample_twitter <- twitter[sample(length(twitter), length(twitter)*0.1)]
Using the quanteda package, I create a corpus for each dataset, and include a “source” value to indicate where that text came from. Then it is combined into a single corpus.
blog_corpus <- corpus(sample_blog)
docvars(blog_corpus, "Source") <- "blog"
news_corpus <- corpus(sample_news)
docvars(news_corpus, "Source") <- "news"
twitter_corpus <- corpus(sample_twitter)
docvars(twitter_corpus, "Source") <- "twitter"
combined_corpus <- blog_corpus + news_corpus + twitter_corpus
Finally, the datasets are all removed, except for the combined corpus, to help clear some RAM space.
rm(stats, blog, news, twitter,
sample_blog, sample_news, sample_twitter,
blog_corpus, news_corpus, twitter_corpus)
To clean the data, I create a token set. I have removed numbers, punctuation, and stopwords. I took the stem of each word (running & runs are both changed into “run” and analyzed together). and everything is changed to lowercase.
token <- tokens(combined_corpus, remove_numbers = TRUE, remove_punct = TRUE)
token <- tokens_select(token, stopwords("english"), selection = "remove")
token <- tokens_wordstem(token)
token <- tokens_tolower(token)
Finally, I created two Document Feature Matrices (a matrix of each unique word as a column, and each text entry as a row). The first is the basic DFM. The second is a DFM grouped by source (twitter/news/blog).
dfm <- dfm(token)
dfm_source <- dfm(token, groups = "Source")
I will start with high level understanding of the data. The following code shows the overall structure of the text.
dfm
## Document-feature matrix of: 426,966 documents, 171,866 features (>99.99% sparse).
dfm_source
## Document-feature matrix of: 3 documents, 171,866 features (53.6% sparse).
dfm_source[,1:10]
## Document-feature matrix of: 3 documents, 10 features (13.3% sparse).
## 3 x 10 sparse Matrix of class "dfm"
## features
## docs bruschetta howev miss mark instead manag two-bit crostini huge
## blog 1 4 1523 996 616 1035 1011 3 1 644
## news 0 4 827 970 857 709 1735 1 4 390
## twitter 0 2 115 3040 347 477 457 0 0 539
Below, we can see the top features of these data. The most common words by themselves, as well as grouped by source. A word cloud is also shown below (the more common a word, the larger it will appear.)
topfeatures(dfm)
## one said just like get go time can day year
## 30791 30650 30496 30041 30034 26520 26125 24659 21947 21624
dfm_sort(dfm_source)[,1:10]
## Document-feature matrix of: 3 documents, 10 features (0.0% sparse).
## 3 x 10 sparse Matrix of class "dfm"
## features
## docs one said just like get go time can day year
## blog 13388 3656 10039 10972 9430 8137 10751 9789 7022 6660
## news 8723 25202 5423 6000 6029 5522 6729 5942 4227 10725
## twitter 8680 1792 15034 13069 14575 12861 8645 8928 10698 4239
textplot_wordcloud(dfm, color = rev(RColorBrewer::brewer.pal(5, "RdYlBu")))
There were several questions posed to consider with respect to this data. I will go over each of them below.
A bar graph of the top frequencies is shown below.
data.frame("names" = names(topfeatures(dfm, n=40)), "count" = unname(topfeatures(dfm, n=40))) %>%
ggplot(aes(x=reorder(names, -count), y=count)) + geom_col() +
labs(x = "", y = "Freqeuncy", title = "Frequency of the Most Common Words") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
N-grams are basically word phrases, so a 2-gram is two words back-to-back. The function below takes a value for n and plots a bargraph showing the frequencies of the top 40 “n”-grams.
ngram <- function(n=1){
ngram_tok <- tokens_ngrams(token, n)
ngram_dfm <- dfm(ngram_tok)
ngram <- data.frame(names = colnames(ngram_dfm), frequency = colSums(ngram_dfm))
ngram$names <- as.character(ngram$names)
rownames(ngram) <- c()
ngram <- arrange(ngram, desc(frequency))[1:40,]
ggplot(ngram, aes(x=reorder(names, -frequency), y=frequency)) + geom_col() +
labs(x = "", y = "Frequency", title = paste0("Frequency of ",n,"-gram Sets")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
}
Then we can plot this data below. Note that I already have a “1-gram” plot above, showing the frequency of the most common single words.
ngram(2)
ngram(3)
Below is a function to calculate the number of unique words required
uniqueWords <- function(coverage) {
data <- data.frame(names = colnames(dfm), frequency = colSums(dfm))
data$names <- as.character(data$names)
rownames(data) <- c()
x <- 0
for(i in 1:nrow(data)) {
x <- x + data$frequency[i]
if (x >= sum(data$frequency)*coverage){ return(i) }
}
return(i)
}
fifty_percent <- uniqueWords(0.5)
fifty_percentage <- round(fifty_percent/ncol(dfm)*100,1)
ninety_percent <- uniqueWords(0.9)
ninety_percentage <- round(ninety_percent/ncol(dfm)*100,1)
You need 1451 words (~0.8%) to cover 50%.
You need 17450 words (~10.2%) to cover 90%.
It should not matter. If enough American-English speakers type “Buenos dias”, then it effectively becomes a part of american english speech, and should be used in predicting text. Just because words may be from another language should not preclude them from the analysis.
Also, modern languages all borrow words from foreign languages, and it is difficult if not impossible to sort through all of that.
However, we could use the UTF symbols - if a word contains a certain type of symbol that is not included in en_US, the word is likely not part of the American English language.
This is an extension of question 3, in which we look at the frequency-based coverage. We can extend that discussion by looking at a graph of that data:
coverage <- c(seq(0, 0.9, 0.1),0.95, 0.99, 1)
unique_words <- c(1:length(coverage))
for (i in 1:length(coverage)){
unique_words[i] <- uniqueWords(coverage[i])
}
ggplot(data.frame(coverage = coverage, unique_words = unique_words), aes(x=coverage, y=unique_words)) +
geom_line() +
labs(x = "Coverage Percentage", y = "Number of Unique Words Required", title = "Number of Unique Words vs. Coverage")
There is clearly an exponential increase in the number of unique words required. I will need to determine the proper amount of words to use to keep the app small, while still covering enough of the language, whether that’s 90%, 95% or some other value.
The main take away this analysis has shown is that I need to be selective on the amount of data I utilize in my final app. The data can get to several gigs in size, and that won’t be acceptable to use. A smaller subset of data should be sufficient for a relatively large amount of coverage.
The next step will involve creating a predictive model, and then creating a shiny app to deploy the model. This will include the results from this exploratory analysis (i.e. utilizing 90% of the word freqeuncy for a much smaller dataset).