Introduction

The following report covers a basic initial assessment of the three files of text data provided from Blogs, Twitter and the news. The overall project calls for the building of a text-prediction application, that will be based off of a sample of this data. However this initial report will consist of the following:

  1. Downloading and cleaning data
  2. Basic summary of the datasets
  3. Interesting findings of the datasets

Initial File assessment

The function “readlines” is used to read in the large files, thereafter each file is assessed for number of words, characters, and lines, into the below summary table:

# num lines per file
numLines <- sapply(list(blogs, news, twitter), length)
# num characters per file
numChars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)
# num words per file
numWords <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
# File size
fileSizeMB <- round(file.info(c(blogsFileName, newsFileName, twitterFileName))$size / 1024 ^ 2)
text_summary = sapply(list(blogs, news, twitter),
                    function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(text_summary) = c('Min', 'Mean', 'Max')

summary <- data.frame(
  File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
  FileSize = paste(fileSizeMB, " MB"),
  Lines = numLines,
  Characters = numChars,
  Words = numWords
)

kable(summary,
      row.names = FALSE, "simple")
File FileSize Lines Characters Words
en_US.blogs.txt 200 MB 899288 206824505 37570839
en_US.news.txt 196 MB 77259 15639408 2651432
en_US.twitter.txt 159 MB 2360148 162096241 30451170

As we can see from the table above, Twitter is by far the largest file in terms of lines - this is due to the nature of the data, with each line limited to 140 characters. THereafter the data was plotted into three histograms of the legth of each line. From the data below we can clearly see Twitter’s line limit. Also visible is the large outliers of length in the blogs and news files, with the Blog file containing lines of over 6000 characters.

#histograms of words per line 

blog_count <- stri_count_words(blogs)
news_count <- stri_count_words(news)
twit_count <- stri_count_words(twitter)


plot1 <- qplot(blog_count,
               geom = "histogram",
               main = "Blogs",
               xlab = "",
               ylab = "Frequency",
               binwidth = 10)
plot2 <- qplot(news_count,
               geom = "histogram",
               main = "News",
               xlab = "",
               ylab = "Frequency",
               binwidth = 10)
plot3 <- qplot(twit_count,
               geom = "histogram",
               main = "Twitter",
               xlab = "Words per Line",
               ylab = "Frequency",
               binwidth = 1)

plotList = list(plot1, plot2, plot3)

do.call(grid.arrange, c(plotList, list(ncol = 1)))

Due to the large file sizes, a sample of the data is taken, to reduce processing time: a sample of 20000 lines from each file is extracted and combined into a simple corpus. Thereafter any swear words, punctuation, numbers, links/URLS, and symbols were removed from the corpus.

# set seed
set.seed(245)

# assign sample size
sample_size = 20000

# Sample from data: 
blog_sample <- sample(blogs, sample_size, replace = FALSE)
news_sample <- sample(news, sample_size, replace = FALSE)
twit_sample <- sample(twitter, sample_size, replace = FALSE)

combo_sample <- c(blog_sample, news_sample, twit_sample) # Combine into single sample 

corpus_all <- corpus(combo_sample)

# remove swear words 
swearwords <- lexicon::profanity_alvarez
bad_dict <- dictionary(list(bad_words = swearwords))
Clean_all <- tokens_remove(tokens(corpus_all),bad_dict)

# Remove numbers, punctuation, links and symbols: 
Clean_all <- tokens(Clean_all, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_url = TRUE)

# Remove stopwords: 
Clean_all_stop <- tokens_remove(Clean_all,pattern = stopwords('en'))

Data Output

The below graph shows the top 25 words, and their frequency in the dataset. The word “said” is be far the most frequently used single word, followed by the word “one” and “just”:

Clean_all_stop %>%
  tokens_tolower %>%
  dfm %>%
  textstat_frequency(n = 25) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() + 
  coord_flip() +
  labs(x = "Words in total sample", y = "Frequency of occurance")

Below is just a small wordcloud of the top 200 most frequent words in the dataset.

words <- textstat_frequency(dfm(Clean_all_stop, tolower= TRUE))
wordcloud(words$feature, words$frequency, max.words = 200)
## Warning in wordcloud(words$feature, words$frequency, max.words = 200): said
## could not be fit on page. It will not be plotted.
## Warning in wordcloud(words$feature, words$frequency, max.words = 200): can could
## not be fit on page. It will not be plotted.
## Warning in wordcloud(words$feature, words$frequency, max.words = 200): just
## could not be fit on page. It will not be plotted.
## Warning in wordcloud(words$feature, words$frequency, max.words = 200): look
## could not be fit on page. It will not be plotted.
## Warning in wordcloud(words$feature, words$frequency, max.words = 200):
## everything could not be fit on page. It will not be plotted.

When assessing the n-grams of the dataset, the below are the most frequent word combinations, with “last year” and “right now” ranking highest. What is intriguing is the number of city references that appear in the below. these common names will need to be accounted for in any model.

Clean_all_stop <- tokens_remove(Clean_all,pattern = stopwords('en'))
Clean_all_stop2 <- tokens_tolower(Clean_all_stop)
Clean_all_stop3 <- tokens_ngrams(Clean_all_stop2, n = 2)


Clean_all_stop %>%
  tokens_tolower %>%
  tokens_ngrams(n = 2) %>%
  dfm %>%
  textstat_frequency(n = 25) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() + 
  coord_flip() +
  labs(x = "2-gram Words in sample", y = "Frequency of occurance")

## Uniqueness of Words:

The below is taken from the work by “Jeff B”, and intially wasnt going to be included in the report, however the process is very important: establishing the minimum size of the required dictionary for an application to run smoothly and quickly. The below shows that, including stopwords, a dictionary of as few as 150 words accounts for over 50% of all words used. Thereafter the uniqueness of words increases, with little gain on the % of total words.

# credit: jgbond/capstone-milestone

uniqueNeeds <- function(dictLength) {
  Clean_all %>%
    dfm(tolower = TRUE) %>%
    textstat_frequency(n = dictLength) -> num
  
  Clean_all %>%
    dfm(tolower = TRUE) %>%
    textstat_frequency() -> den
  
  data.frame(dictLength = dictLength,
             totalCover = sum(num$frequency) / sum(den$frequency), # Percentage of total word count covered by top N words
             uniqueCover = dictLength / length(den$feature)) # Percentage of unique words covered by top N words
   
}

wordCoverTable <- rbind(uniqueNeeds(10), uniqueNeeds(100),
                        uniqueNeeds(150), uniqueNeeds(250), uniqueNeeds(500),
                        uniqueNeeds(1000), uniqueNeeds(2000), uniqueNeeds(3000), 
                        uniqueNeeds(4000))

wordCoverPlot <- ggplot(data = wordCoverTable, aes(x = dictLength, y = totalCover)) +
  geom_line() +
  labs(title = "Word Coverage",
       x = "Top N Words",
       y = "% Total Covered by Top N Words") +
  theme(legend.position="bottom")

wordCoverTable
##   dictLength totalCover  uniqueCover
## 1         10  0.2156526 0.0001261002
## 2        100  0.4622781 0.0012610022
## 3        150  0.5070787 0.0018915034
## 4        250  0.5573705 0.0031525056
## 5        500  0.6269458 0.0063050112
## 6       1000  0.7007134 0.0126100224
## 7       2000  0.7746455 0.0252200449
## 8       3000  0.8156671 0.0378300673
## 9       4000  0.8426271 0.0504400898

Conclusions:

In the next steps, a model will be developed to predict the most likely word chains for the above texts, using the sample extracted. thereafter this can be tested on the remaining unsampled texts avaialable.

References: