We use HC corpora data collected across English Twitter posts, blogs and news. To get insight into Natural Language Processing and eventually construct a prediction algorithm for the next word during typing (e.g. auto-suggestions on mobile devices).
We will use the following files:
| File | Lines | Words |
|---|---|---|
| en_US.blogs.txt | 899288 | 37334114 |
| en_US.news.txt | 1010242 | 34365936 |
| en_US.twitter.txt | 2360148 | 30359804 |
The first step is to acquire and load the data. We obtain English Twitter files from Coursera Project webpage and in view of computational restrictions we choose a representative random subset of 10 thousand lines (using Linux sort -R command).
First we load the text using the following R code.
#Representative subset of 10000 lines saved in test_file
fileName <- "test_file"
#Create file connection
con <- file(fileName, "r")
#Read all words
d <- scan(con, character(0))
# Close connection
close(con)
Next, we want to exclude profanity from the analysis. We created a text file with common English profanity words and load it into our R session.
con <- file("bad_words", "r")
bad <- scan(con, character(0))
close(con)
For analysis we tokenize the text into words (excluding “bad” words), numbers and punctuation. We store those elements in 3 different vectors.
d <- lapply(d, toupper)
bad <- lapply(bad, toupper)
words <- c(); numbers <- c(); puncts <- c()
for (i in (1:300000)){
element <- d[[i]]
word <- gsub("[^A-Z]+", "", element)
good <-1
if (word %in% bad) {good <-0}
if ((word > 0)&good){words <- c(words, word)}
number <- gsub("[^0-9]+", "", element)
if (number > 0){numbers <- c(numbers, number)}
pun <- gsub("[^[:punct:]]", "", element)
if (nchar(pun) > 0){puncts <- c(puncts, pun)}
}
An interesting question is: what are the ten most common words in the dataset?
| word | times |
|---|---|
| THE | 14751 |
| AND | 8623 |
| TO | 8462 |
| A | 7202 |
| OF | 6953 |
| I | 6180 |
| IN | 4672 |
| THAT | 3763 |
| IS | 3439 |
| IT | 3194 |
These are common English particles. Let us look at the most used “long” words (more than 6 letters):
| word | times |
|---|---|
| BECAUSE | 467 |
| THROUGH | 282 |
| SOMETHING | 280 |
| ANOTHER | 247 |
| THOUGHT | 207 |
| WITHOUT | 155 |
| DIFFERENT | 153 |
| FRIENDS | 152 |
| GETTING | 148 |
| STARTED | 148 |
To understand how many words constitute the major part of the language, we plot the histogram of the frequencies of words.
We observe that most of the words have low (<5%) frequency.
The following graph shows how many distinct words are necessary to cover a part of the text in percentage.
An interresting observation is that we need only 200 words to cover about 50% of any text and 1000 words to cover 80%. We conclude that for English language an average text contains about 20% of rare words.
Many words tend to be used together, meaning that it is possible to give a prediction with high probability of the following word. For instance, let us look at the most common occurring word pairs.
| Words pair | Times |
|---|---|
| OF THE | 1409 |
| IN THE | 1219 |
| TO THE | 679 |
| ON THE | 600 |
| TO BE | 532 |
| AND THE | 462 |
| FOR THE | 462 |
| AND I | 448 |
| AT THE | 403 |
| IT WAS | 402 |
| I HAVE | 396 |
| IT IS | 373 |
| I WAS | 373 |
| IN A | 371 |
| IS A | 352 |
| I AM | 325 |
| WITH A | 321 |
| THAT I | 319 |
| WITH THE | 312 |
| OF A | 281 |
| FROM THE | 271 |
| IF YOU | 257 |
| THAT THE | 241 |
| THIS IS | 239 |
| ONE OF | 235 |
| WILL BE | 230 |
| IS THE | 227 |
| AS A | 226 |
| FOR A | 226 |
| YOU CAN | 223 |
| BY THE | 220 |
| OF MY | 218 |
| I HAD | 216 |
| BUT I | 214 |
| WAS A | 203 |
| ALL THE | 202 |
| HAVE A | 201 |
| WHEN I | 197 |
| A FEW | 196 |
| AND A | 196 |
| I THINK | 193 |
| HAVE BEEN | 191 |
| THE SAME | 190 |
| TO DO | 190 |
| GOING TO | 188 |
| HAVE TO | 188 |
| WANT TO | 187 |
| OUT OF | 184 |
| TO GET | 183 |
| TO MAKE | 179 |
We can also explore collocations of longer words (>3 characters).
| Words pair | Times |
|---|---|
| HAVE BEEN | 199 |
| MORE THAN | 114 |
| WOULD HAVE | 101 |
| THEY WERE | 100 |
| THEY HAVE | 86 |
| THAT THEY | 84 |
| THAT HAVE | 74 |
| THAT WOULD | 71 |
| KNOW THAT | 68 |
| THAT THIS | 60 |
| THAT WILL | 59 |
| THIS WEEK | 53 |
| THIS YEAR | 53 |
| THERE WERE | 52 |
| WHEN THEY | 52 |
| DONT KNOW | 46 |
| KNOW WHAT | 42 |
| THAT THERE | 42 |
| LIKE THAT | 41 |
| LIKE THIS | 41 |
| THINK THAT | 41 |
| DONT HAVE | 40 |
| EVEN THOUGH | 40 |
| WITH THIS | 40 |
| THIS TIME | 39 |
| WHAT THEY | 39 |
| ALONG WITH | 38 |
| FACT THAT | 38 |
| ABOUT THIS | 37 |
| BECAUSE THEY | 36 |
| COULD HAVE | 36 |
| WOULD LIKE | 36 |
| RATHER THAN | 35 |
| THEY WILL | 35 |
| WILL HAVE | 35 |
| ABOUT WHAT | 34 |
| THAT COULD | 34 |
| WITH THAT | 34 |
| EACH OTHER | 33 |
| LAST NIGHT | 33 |
| LAST WEEK | 33 |
| WITH SOME | 33 |
| WITH YOUR | 33 |
| FEEL LIKE | 32 |
| MUCH MORE | 32 |
| THEY WOULD | 32 |
| AWAY FROM | 31 |
| TALKING ABOUT | 30 |
| THAT JUST | 30 |
| THINGS THAT | 30 |
In the following publication we provide prediction algorithm for the next word during typesetting.