The purpose of this document is to describe the findings of the exploratory analysis conducted on a set of corpora collected from publicly available sources by a web crawler. There are 3 different sources (blogs, Twitter and news) that have been already pre-cleaned so that they are anonimized. In particular this document focus on 3 files that report blogs, news and Tweets in English language. Each line is a piece of text from a blog or a news site or a tweet. Each line is independent from the other lines in the file.
First of all, looking at the number of rows in each file, we could see that it is a very high one:
## [1] "Number of rows in the Blog file: 899,288"
## [1] "Number of rows in the Twitter file: 2,360,148"
## [1] "Number of rows in the News file: 1,010,242"
We decided to use a simplified version of the sentences in the files, so we removed from the three files all the sentences containing only 1 word, then we removed punctuation except the apostrophe that in English language is used in contractions and short forms and finally we removed capitalized letter. We know that in this way we are limiting the power of the model because it won’t be able to suggest correctly first names or surname, city names and important punctuation as dots or commas, but for the purpose of this document is enough. Furthermore, due to the large amount of data and to speed up the exploratory analysis, it has been decided to subsample the 3 files.
#Example of cleaning:
tweet_ori <- read_lines(file = "./final/en_US/en_US.twitter.txt")
tweet <- tibble(text = tweet_ori) %>%
mutate(text = str_replace_all(text, pattern = c("’" = "'","(?!')[[:punct:]]" = " ")) %>% str_squish() %>% tolower(),
num_word = str_count(text, boundary("word"))) %>%
filter(num_word > 1)
However first we looked at the distribution of words in the sentences of the three files:
## Summary of Blog sentences words after cleaning:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 9.0 29.0 42.3 60.0 6817.0
## Summary of Twitter sentences words after cleaning:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 7.00 12.00 12.84 18.00 47.00
## Summary of News sentences words after cleaning:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 19.00 32.00 34.73 46.00 1928.00
The same information is visualized in the following pictures:
The Twitter summary confirms that it’s the file with the shortest sentences, due to constraints of the Twitter itself (Limited number of characters allowed in a single Tweet).
The other two files present some long sentences. Here we report an extract of the two with the higest number of words: one from the “blog” file and one from the “news” one. The first reports a detailed description of the heartquakes that damaged the Fukushima reactors:
“update as of 11 30 a m edt monday april 11 no damage to japan’s nuclear power plants was reported today after another strong aftershock hit the northeast coast the temblor measured at magnitude 6 6 by the u s geological survey rocked the country one month after the magnitude 9 0 earthquake and tsuna”
The other from the “news” file is about US elections results:
“democrats president of the u s barack obama 46 938 united states senator joe donnelly 42 265 governor john r gregg 41 598 u s representative district 5 tony long 3 236 scott reske 4 819 u s representative district 7 andre d carson 34 666 bob citizen kern 2 036 pierre quincy pullins 581 woodrow wilco”
This is a very uninformative sentence for our analysis (where we need to predict the following word on a sentence), because it’s a sequence of names and surnames (although some are very well known) followed from number of votes they received. We might want to exclude it from further steps.
The number of words in a sentence, however, doesn’t describe the variety of words used, we need another measure that represents the richness of the vocabulary and on which calculating the sample size.
We used a priori informations about the most frequent English words to calculate the sample size at a 95% (or more) confidence level. There are many sources about the most frequent English words but we decided to use the list shown here https://www.wordfrequency.info/free.asp?s=y that is based on the Corpus of Contemporary American English. The most frequent words are unsurprisingly: “the, and, to…” prepositions, pronouns, articles and so on. We calculated the TTR (types-token rate) of the 13 most frequent words suggested by the list and calculated the sample size with the TTR of the less frequent of them in the file. The subsampling process is a simple random sampling without replacement of the sentencies in a file.
#Example of TTR calculus
news <- news %>%
mutate(num_the = str_count(text, pattern = "^the | the | the$"),
num_to = str_count(text, pattern = "^to | to | to$"),
num_and = str_count(text, pattern = "^and | and | and$"),
num_for = str_count(text, pattern = "^for | for | for$"),
num_that = str_count(text, pattern = "^that | that | that$"),
num_you = str_count(text, pattern = "^you | you | you$"),
num_of = str_count(text, pattern = "^of | of | of$"),
num_a = str_count(text, pattern = "^a | a | a$"),
num_in = str_count(text, pattern = "^in | in | in$"),
num_he = str_count(text, pattern = "^he | he | he$"),
num_with = str_count(text, pattern = "^with | with | with$"),
num_it = str_count(text, pattern = "^it | it | it$"),
num_i = str_count(text, pattern = "^i | i | i$"))
sum_news <- news %>%
summarise_at(.vars = vars(starts_with("num")), .funs = sum) %>%
mutate_all(.funs = function(x,y) x/y, y = .$num_word) %>%
transpose() %>%
unlist()
#TTR of the less frequent word in News file among the 13 most frequent words ever.
sum_news[which.min(sum_news)]
## num_you
## 0.002739027
#TTR of the less frequent word in Blog file among the 13 most frequent words ever.
sum_blog[which.min(sum_blog)]
## num_he
## 0.003858661
#TTR of the less frequent word in Twitter file among the 13 most frequent words ever.
sum_tweet[which.min(sum_tweet)]
## num_he
## 0.00186901
From these value we calculated the minimum sample size of the three samples with a precision of 1/5 of the TTR value.
## [1] "Minimum sample size required for Blog file: 14,607"
## [1] "Minimum sample size required for Twitter file: 51,289"
## [1] "Minimum sample size required for News file: 34,968"
So we sampled 15,000 lines from Blog file, 52,000 from Twitter and 35,000 from News.
Then the expected confidence intervals for the 3 TTRs are:
## [1] "Expected C.I. 95% for TTR of the word you in News file: [0.00219148-0.00328658]"
## [1] "Expected C.I. 95% for TTR of the word he in Blog file: [0.00524284-0.00782118]"
## [1] "Expected C.I. 95% for TTR of the word he in Twitter file: [0.00149777-0.00224025]"
## [1] "Observed TTR in News sample: 0.00272757"
## [1] "Observed TTR in Blog sample: 0.00387954"
## [1] "Observed TTR in Twitter sample: 0.00186854"
We created unigrams and bigrams from the 3 files. The following plots show the top 10 and the last 10 bottom bigrams for each file.
As expected we found that bigrams with higher frequency are word pairs as conjunction + article, preposition + article, preposition + verb, subject + verb. If we look at the trigrams we could see that they are always related to grammatical rules or short complete sentences as “I love you” in the Twitter file.
To start to see some interesting sequences of words we needed to run 5-grams as shown here:
For this reason we considered to implement a predictive model that takes into consideration sequences of at least 4 or 5 words.