The following report covers a basic initial assessment of the three files of text data provided from Blogs, Twitter and the news. The overall project calls for the building of a text-prediction application, that will be based off of a sample of this data. However this initial report will consist of the following:
The function “readlines” is used to read in the large files, thereafter each file is assessed for number of words, characters, and lines, into the below summary table:
# num lines per file
numLines <- sapply(list(blogs, news, twitter), length)
# num characters per file
numChars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)
# num words per file
numWords <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
# File size
fileSizeMB <- round(file.info(c(blogsFileName, newsFileName, twitterFileName))$size / 1024 ^ 2)
text_summary = sapply(list(blogs, news, twitter),
function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(text_summary) = c('Min', 'Mean', 'Max')
summary <- data.frame(
File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
FileSize = paste(fileSizeMB, " MB"),
Lines = numLines,
Characters = numChars,
Words = numWords
)
kable(summary,
row.names = FALSE, "simple")
| File | FileSize | Lines | Characters | Words |
|---|---|---|---|---|
| en_US.blogs.txt | 200 MB | 899288 | 206824505 | 37570839 |
| en_US.news.txt | 196 MB | 77259 | 15639408 | 2651432 |
| en_US.twitter.txt | 159 MB | 2360148 | 162096241 | 30451170 |
As we can see from the table above, Twitter is by far the largest file in terms of lines - this is due to the nature of the data, with each line limited to 140 characters. THereafter the data was plotted into three histograms of the legth of each line. From the data below we can clearly see Twitter’s line limit. Also visible is the large outliers of length in the blogs and news files, with the Blog file containing lines of over 6000 characters.
#histograms of words per line
blog_count <- stri_count_words(blogs)
news_count <- stri_count_words(news)
twit_count <- stri_count_words(twitter)
plot1 <- qplot(blog_count,
geom = "histogram",
main = "Blogs",
xlab = "",
ylab = "Frequency",
binwidth = 10)
plot2 <- qplot(news_count,
geom = "histogram",
main = "News",
xlab = "",
ylab = "Frequency",
binwidth = 10)
plot3 <- qplot(twit_count,
geom = "histogram",
main = "Twitter",
xlab = "Words per Line",
ylab = "Frequency",
binwidth = 1)
plotList = list(plot1, plot2, plot3)
do.call(grid.arrange, c(plotList, list(ncol = 1)))
Due to the large file sizes, a sample of the data is taken, to reduce processing time: a sample of 20000 lines from each file is extracted and combined into a simple corpus. Thereafter any swear words, punctuation, numbers, links/URLS, and symbols were removed from the corpus.
# set seed
set.seed(245)
# assign sample size
sample_size = 20000
# Sample from data:
blog_sample <- sample(blogs, sample_size, replace = FALSE)
news_sample <- sample(news, sample_size, replace = FALSE)
twit_sample <- sample(twitter, sample_size, replace = FALSE)
combo_sample <- c(blog_sample, news_sample, twit_sample) # Combine into single sample
corpus_all <- corpus(combo_sample)
# remove swear words
swearwords <- lexicon::profanity_alvarez
bad_dict <- dictionary(list(bad_words = swearwords))
Clean_all <- tokens_remove(tokens(corpus_all),bad_dict)
# Remove numbers, punctuation, links and symbols:
Clean_all <- tokens(Clean_all, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_url = TRUE)
# Remove stopwords:
Clean_all_stop <- tokens_remove(Clean_all,pattern = stopwords('en'))
The below graph shows the top 25 words, and their frequency in the dataset. The word “said” is be far the most frequently used single word, followed by the word “one” and “just”:
Clean_all_stop %>%
tokens_tolower %>%
dfm %>%
textstat_frequency(n = 25) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = "Words in total sample", y = "Frequency of occurance")
Below is just a small wordcloud of the top 200 most frequent words in the dataset.
words <- textstat_frequency(dfm(Clean_all_stop, tolower= TRUE))
wordcloud(words$feature, words$frequency, max.words = 200)
## Warning in wordcloud(words$feature, words$frequency, max.words = 200): said
## could not be fit on page. It will not be plotted.
## Warning in wordcloud(words$feature, words$frequency, max.words = 200): can could
## not be fit on page. It will not be plotted.
## Warning in wordcloud(words$feature, words$frequency, max.words = 200): just
## could not be fit on page. It will not be plotted.
## Warning in wordcloud(words$feature, words$frequency, max.words = 200): look
## could not be fit on page. It will not be plotted.
## Warning in wordcloud(words$feature, words$frequency, max.words = 200):
## everything could not be fit on page. It will not be plotted.
When assessing the n-grams of the dataset, the below are the most frequent word combinations, with “last year” and “right now” ranking highest. What is intriguing is the number of city references that appear in the below. these common names will need to be accounted for in any model.
Clean_all_stop <- tokens_remove(Clean_all,pattern = stopwords('en'))
Clean_all_stop2 <- tokens_tolower(Clean_all_stop)
Clean_all_stop3 <- tokens_ngrams(Clean_all_stop2, n = 2)
Clean_all_stop %>%
tokens_tolower %>%
tokens_ngrams(n = 2) %>%
dfm %>%
textstat_frequency(n = 25) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = "2-gram Words in sample", y = "Frequency of occurance")
## Uniqueness of Words:
The below is taken from the work by “Jeff B”, and intially wasnt going to be included in the report, however the process is very important: establishing the minimum size of the required dictionary for an application to run smoothly and quickly. The below shows that, including stopwords, a dictionary of as few as 150 words accounts for over 50% of all words used. Thereafter the uniqueness of words increases, with little gain on the % of total words.
# credit: jgbond/capstone-milestone
uniqueNeeds <- function(dictLength) {
Clean_all %>%
dfm(tolower = TRUE) %>%
textstat_frequency(n = dictLength) -> num
Clean_all %>%
dfm(tolower = TRUE) %>%
textstat_frequency() -> den
data.frame(dictLength = dictLength,
totalCover = sum(num$frequency) / sum(den$frequency), # Percentage of total word count covered by top N words
uniqueCover = dictLength / length(den$feature)) # Percentage of unique words covered by top N words
}
wordCoverTable <- rbind(uniqueNeeds(10), uniqueNeeds(100),
uniqueNeeds(150), uniqueNeeds(250), uniqueNeeds(500),
uniqueNeeds(1000), uniqueNeeds(2000), uniqueNeeds(3000),
uniqueNeeds(4000))
wordCoverPlot <- ggplot(data = wordCoverTable, aes(x = dictLength, y = totalCover)) +
geom_line() +
labs(title = "Word Coverage",
x = "Top N Words",
y = "% Total Covered by Top N Words") +
theme(legend.position="bottom")
wordCoverTable
## dictLength totalCover uniqueCover
## 1 10 0.2156526 0.0001261002
## 2 100 0.4622781 0.0012610022
## 3 150 0.5070787 0.0018915034
## 4 250 0.5573705 0.0031525056
## 5 500 0.6269458 0.0063050112
## 6 1000 0.7007134 0.0126100224
## 7 2000 0.7746455 0.0252200449
## 8 3000 0.8156671 0.0378300673
## 9 4000 0.8426271 0.0504400898
In the next steps, a model will be developed to predict the most likely word chains for the above texts, using the sample extracted. thereafter this can be tested on the remaining unsampled texts avaialable.